Prática 1: Visão Geral da Biblioteca Sklearn#
Biblioteca Scikit-Learn#
A biblioteca sklearn é uma das mais utilizadas atualmente, tanto na academia quanto no mercado. O que faz ela tão popular é ser escrita em Python (linguagem muito usada no mercado), possuir um grupo de desenvolvedores ativo e ser Open Source.
A seguir descreveremos as principais funcionalidades da biblioteca, bem como os principais algoritmos desenvolvidos.
Métodos de Classe Mais Usados#
fit (X, [y]): Constrói o modelo a partir do conjunto de treinamento. No caso de algoritmo não supervisonado o y pode ser omitido.
predict (X): Prediz a classe ou valor de regressão.
predict_proba (X): Previsão das probabilidades de classe das amostras de entrada X.
score (X, y[, sample_weight]): Retorna o score para o conjunto de dados:
Para classificação é acurácia
Para regressão é R2
set_params (params): Ajusta os hiperparâmetros do algoritmo.
get_params ([deep]): Retorna os hiperparâmetros do algoritmo.
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from sklearn.datasets import make_moons, make_blobs, make_regression, make_classification
from sklearn.model_selection import train_test_split
np.set_printoptions(suppress=True)
X, y = make_blobs(n_samples=500, n_features=2, centers=3, random_state=0)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)
from sklearn.linear_model import LogisticRegression
lrc = LogisticRegression(random_state=42)
model = lrc.fit(X_train, y_train)
preds = model.predict(X_test)
predsp = model.predict_proba(X_test)
scores = model.score(X_test, y_test)
print("Model: ", lrc)
print("Model hps: ", lrc.get_params())
print("Model Score: ", scores)
print("Model prediction: \n", preds[:10])
print("Model prediction (proba): \n", predsp[:10])
Model: LogisticRegression(random_state=42)
Model hps: {'C': 1.0, 'class_weight': None, 'dual': False, 'fit_intercept': True, 'intercept_scaling': 1, 'l1_ratio': None, 'max_iter': 100, 'multi_class': 'auto', 'n_jobs': None, 'penalty': 'l2', 'random_state': 42, 'solver': 'lbfgs', 'tol': 0.0001, 'verbose': 0, 'warm_start': False}
Model Score: 0.94
Model prediction:
[1 2 2 0 0 0 1 2 2 2]
Model prediction (proba):
[[0.00010604 0.99988101 0.00001294]
[0.00681164 0.00000441 0.99318395]
[0.00124705 0.00063736 0.99811559]
[0.9981574 0.00003683 0.00180576]
[0.99644246 0.00000272 0.00355483]
[0.99873481 0.00000168 0.00126351]
[0.00007422 0.99987744 0.00004834]
[0.00242916 0.00004639 0.99752445]
[0.39413349 0.0837987 0.52206781]
[0.00635346 0.00354697 0.99009957]]
transform: Aplica uma transformação nos dados
fit_transform: Faz fit e a seguir aplica a tranformação nos dados de entrada
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)
X_test_ = scaler.transform(X_test)
X_test_[:10]
array([[ 1.08678365, -1.5164436 ],
[-1.55876784, 0.65593415],
[-1.31288369, -0.37880668],
[ 0.24177302, 1.46105908],
[-0.09670804, 1.81924233],
[ 0.05017578, 1.9255206 ],
[ 0.81429729, -1.62586094],
[-1.46497424, 0.12512164],
[-0.23033232, -0.0316933 ],
[-1.01038457, -0.33584257]])
X_train_ = scaler.fit_transform(X_train)
X_train_[:10]
array([[ 0.29038368, 1.12950646],
[-1.13999693, 0.59901388],
[ 0.01306258, 1.97048737],
[ 1.50049455, -2.55336023],
[ 0.37695627, -1.46288631],
[ 0.49285469, -1.82403643],
[ 0.6848445 , -0.01627551],
[-1.62026329, -0.38417894],
[ 0.97952963, 0.19779602],
[ 1.8370938 , -0.70870969]])
Algoritmos de Classificação#
X, y = make_blobs(n_samples=500, n_features=2, centers=3, random_state=0)
df = pd.DataFrame(
{
'feature-1': X[:, 0],
'feature-2': X[:, 1],
'target': y
}
)
features = ["feature-1", "feature-2"]
target = "target"
sns.scatterplot(data=df, x=features[0], y=features[1], hue=target)
plt.show()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)
Logistic Regression (LogisticRegression)#
from sklearn.linear_model import LogisticRegression
lrc = LogisticRegression(random_state=42)
model = lrc.fit(X_train, y_train)
pred = model.predict(X_test)
score = model.score(X_test, y_test)
print(pred[:10])
print(score)
[1 2 2 0 0 0 1 2 2 2]
0.94
k-Nearest Neighbors (KNeighborsClassifier)#
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()
model = knn.fit(X_train, y_train)
pred = model.predict(X_test)
score = model.score(X_test, y_test)
print(pred[:10])
print(score)
[1 2 2 0 0 0 1 2 0 2]
0.9266666666666666
C:\Users\mlfernandez\Anaconda3\lib\site-packages\sklearn\neighbors\_classification.py:228: FutureWarning: Unlike other reduction functions (e.g. `skew`, `kurtosis`), the default behavior of `mode` typically preserves the axis it acts along. In SciPy 1.11.0, this behavior will change: the default value of `keepdims` will become False, the `axis` over which the statistic is taken will be eliminated, and the value None will no longer be accepted. Set `keepdims` to True or False to avoid this warning.
mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
C:\Users\mlfernandez\Anaconda3\lib\site-packages\sklearn\neighbors\_classification.py:228: FutureWarning: Unlike other reduction functions (e.g. `skew`, `kurtosis`), the default behavior of `mode` typically preserves the axis it acts along. In SciPy 1.11.0, this behavior will change: the default value of `keepdims` will become False, the `axis` over which the statistic is taken will be eliminated, and the value None will no longer be accepted. Set `keepdims` to True or False to avoid this warning.
mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
Decision Tree (DecisionTreeClassifier)#
from sklearn.tree import DecisionTreeClassifier
dtc = DecisionTreeClassifier(random_state=42)
model = dtc.fit(X_train, y_train)
pred = model.predict(X_test)
score = model.score(X_test, y_test)
print(pred[:10])
print(score)
[1 2 2 0 0 0 1 2 0 2]
0.9266666666666666
Naive Bayes (GaussianNB)#
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
model = gnb.fit(X_train, y_train)
pred = model.predict(X_test)
score = model.score(X_test, y_test)
print(pred[:10])
print(score)
[1 2 2 0 0 0 1 2 2 2]
0.9333333333333333
Random Forest (RandomForestClassifier)#
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(random_state=42)
model = rfc.fit(X_train, y_train)
pred = model.predict(X_test)
score = model.score(X_test, y_test)
print(pred[:10])
print(score)
[1 2 2 0 0 0 1 2 0 2]
0.94
Support Vector Machine (SVC)#
from sklearn.svm import SVC
svc = SVC(random_state=42)
model = svc.fit(X_train, y_train)
pred = model.predict(X_test)
score = model.score(X_test, y_test)
print(pred[:10])
print(score)
[1 2 2 0 0 0 1 2 2 2]
0.9333333333333333
Multi-layer Perceptron classifier (MLPClassifier)#
from sklearn.neural_network import MLPClassifier
mlp = MLPClassifier(random_state=42)
model = mlp.fit(X_train, y_train)
pred = model.predict(X_test)
score = model.score(X_test, y_test)
print(pred[:10])
print(score)
[1 2 2 0 0 0 1 2 0 2]
0.92
C:\Users\mlfernandez\Anaconda3\lib\site-packages\sklearn\neural_network\_multilayer_perceptron.py:692: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet.
warnings.warn(
Algoritmos de Regressão#
X, y = make_regression(
n_samples=500,
n_features=10,
n_informative=8,
noise=30,
random_state=1
)
df = pd.DataFrame(
{
'feature-1': X[:, 0],
'target': y
}
)
features = ["feature-1", "feature-2"]
target = "target"
sns.scatterplot(data=df, x=features[0], y=target)
plt.show()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)
Lasso Regression (Lasso)#
from sklearn.linear_model import Lasso
las = Lasso(random_state=42)
model = las.fit(X_train, y_train)
pred = model.predict(X_test)
score = model.score(X_test, y_test)
print(pred[:10])
print(score)
[-115.7708748 -46.31684795 -20.86665902 -158.62497355 382.25990628
-124.28940267 68.21063266 -164.95106788 -328.33404498 222.81955835]
0.9469180813122322
Ridge Regression (Ridge)#
from sklearn.linear_model import Ridge
rid = Ridge(random_state=42)
model = rid.fit(X_train, y_train)
pred = model.predict(X_test)
score = model.score(X_test, y_test)
print(pred[:10])
print(score)
[-122.78992165 -46.12780147 -21.94305706 -162.50410571 387.85141669
-125.93351061 71.01896076 -167.59242227 -334.32584704 225.75003396]
0.9468979075721192
ElasticNet Regression (ElasticNet)#
from sklearn.linear_model import ElasticNet
ent = ElasticNet(random_state=42)
model = ent.fit(X_train, y_train)
pred = model.predict(X_test)
score = model.score(X_test, y_test)
print(pred[:10])
print(score)
[ -71.19710668 -25.93729254 -15.55881416 -101.25000773 253.68386461
-78.39175189 42.01147833 -104.95807672 -219.87137993 160.11027644]
0.8339397652780292
k-Nearest Neighbors (KNeighborsRegressor)#
from sklearn.neighbors import KNeighborsRegressor
knr = KNeighborsRegressor()
model = knr.fit(X_train, y_train)
pred = model.predict(X_test)
score = model.score(X_test, y_test)
print(pred[:10])
print(score)
[ -99.6684154 -16.38105864 -50.66401373 -57.41325167 239.25981724
-128.72638105 83.64068093 -146.05781319 -255.81035766 152.25888883]
0.7061793685321537
Decision Tree (DecisionTreeRegressor)#
from sklearn.tree import DecisionTreeRegressor
dtr = DecisionTreeRegressor(random_state=42)
model = dtr.fit(X_train, y_train)
pred = model.predict(X_test)
score = model.score(X_test, y_test)
print(pred[:10])
print(score)
[ 55.72824398 -34.50897194 -28.79172463 -145.85559682 271.30919082
-70.81129949 171.02917036 -145.85559682 -325.1987638 274.33407544]
0.5107145300278733
Random Forest (RandomForestRegressor)#
from sklearn.ensemble import RandomForestRegressor
rfr = RandomForestRegressor(random_state=42)
model = rfr.fit(X_train, y_train)
pred = model.predict(X_test)
score = model.score(X_test, y_test)
print(pred[:10])
print(score)
[ 24.1702344 -42.91186327 -17.1736669 -139.23683542 218.26798239
-118.29363653 82.02520146 -138.82174793 -263.34837058 231.63570598]
0.8015615355305776
Support Vector Machine (SVR)#
from sklearn.svm import SVR
svr = SVR()
model = svr.fit(X_train, y_train)
pred = model.predict(X_test)
score = model.score(X_test, y_test)
print(pred[:10])
print(score)
[ 0.44496551 -2.50843853 1.67095038 -3.91897052 16.13692994
-7.41046195 6.61101378 -9.16400223 -11.47253449 19.69234491]
0.10988234690190346
Multi-layer Perceptron(MLPRegressor)#
from sklearn.neural_network import MLPRegressor
mlpr = MLPRegressor(random_state=42)
model = mlpr.fit(X_train, y_train)
pred = model.predict(X_test)
score = model.score(X_test, y_test)
print(pred[:10])
print(score)
[-68.09169775 -8.98875781 -19.65923616 -40.95727195 158.65427625
-30.366409 39.17966508 -46.90013887 -99.71012589 106.20396741]
0.5630287207632332
C:\Users\mlfernandez\Anaconda3\lib\site-packages\sklearn\neural_network\_multilayer_perceptron.py:692: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet.
warnings.warn(
Algoritmos de Agrupamento#
X, _ = make_blobs(n_samples=500, n_features=2, centers=3, random_state=1)
X = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))
df = pd.DataFrame(
{
'feature-1': X[:, 0],
'feature-2': X[:, 1],
}
)
features = ["feature-1", "feature-2"]
target = "target"
sns.scatterplot(data=df, x=features[0], y=features[1])
plt.show()
K-Means (KMeans)#
from sklearn.cluster import KMeans
kmeans = KMeans(random_state=42)
y_ = kmeans.fit_predict(X)
print(y_[:10])
[1 4 7 3 2 3 5 2 5 6]
Agglomerative Clustering (AgglomerativeClustering)#
from sklearn.cluster import AgglomerativeClustering
agc = AgglomerativeClustering()
y_ = agc.fit_predict(X)
print(y_[:10])
[0 0 0 1 0 1 0 0 0 0]
Density-Based Spatial Clustering of Applications with Noise (DBSCAN)#
from sklearn.cluster import DBSCAN
dbs = DBSCAN()
y_ = dbs.fit_predict(X)
print(y_[:10])
[0 0 0 0 0 0 0 0 0 0]
Técnicas de Pré-Processamento & Redução de Dimensionalidade#
Existe uma vasta gama de técnicas de pré-processamento e redução de dimensionalidade no sklearn. Você pode encontrar mais informações sobre elas nos seguintes links:
A seguir destacaremos as mais comuns bem como um exemplo de uso com código:
Imputação para Completar Valores Ausentes (SimpleImputer)#
X, y = make_blobs(n_samples=500, n_features=2, centers=3, random_state=0)
X[:25, 0] = np.nan
X[25:50, 1] = np.nan
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)
list(map(lambda x: np.isnan(x).sum(), [X_train, X_test, y_train, y_test]))
[32, 18, 0, 0]
from sklearn.impute import SimpleImputer
imp_mean = SimpleImputer(missing_values=np.nan, strategy="mean")
imp_mean.fit(X_train)
X_train_ = imp_mean.transform(X_train)
X_test_ = imp_mean.transform(X_test)
print("Train nan [count]: ", np.isnan(X_train).sum())
print("Train nan [count]: ", np.isnan(X_test).sum())
print("Train transformed nan [count]: ", np.isnan(X_train_).sum())
print("Test transformed nan [count]: ", np.isnan(X_test_).sum())
Train nan [count]: 32
Train nan [count]: 18
Train transformed nan [count]: 0
Test transformed nan [count]: 0
Dimensiona as Features para um Determinado Intervalo (MinMaxScaler)#
X, y = make_blobs(n_samples=500, n_features=2, centers=3, random_state=0)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)
list(map(lambda x: (np.min(x), np.max(x)), [X_train, X_test, y_train, y_test]))
[(-4.10970064619185, 6.560510824746599),
(-3.3511606691597353, 6.213852280547424),
(0, 2),
(0, 2)]
from sklearn.preprocessing import MinMaxScaler
mm = MinMaxScaler()
mm.fit(X_train)
X_train_ = mm.transform(X_train)
X_test_ = mm.transform(X_test)
print("X_train min/max", (np.min(X_train_), np.max(X_train_)))
print("X_test min/max", (np.min(X_test_), np.max(X_test_)))
X_train min/max (0.0, 1.0)
X_test min/max (0.04484147438192271, 0.960195322826444)
Seleção de Features (SelectKBest)#
X, y = make_classification(n_samples=500, n_features=10, n_informative=5, random_state=0)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)
list(map(lambda x: x.shape, [X_train, X_test, y_train, y_test]))
[(350, 10), (150, 10), (350,), (150,)]
from sklearn.feature_selection import SelectKBest
skb = SelectKBest(k=5)
skb.fit(X_train, y_train)
X_train_ = skb.transform(X_train)
X_test_ = skb.transform(X_test)
print("X_train shape: ", X_train_.shape)
print("X_test shape: ", X_test_.shape)
X_train shape: (350, 5)
X_test shape: (150, 5)
Principal Component Analysis (PCA)#
X, y = make_classification(n_samples=500, n_features=10, n_informative=5, random_state=0)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)
list(map(lambda x: x.shape, [X_train, X_test, y_train, y_test]))
[(350, 10), (150, 10), (350,), (150,)]
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
ss = StandardScaler()
ss.fit(X_train)
X_train_ = ss.transform(X_train)
X_test_ = ss.transform(X_test)
pca = PCA(n_components=5, random_state=0)
pca.fit(X_train_)
X_train_pca = pca.transform(X_train_)
X_test_pca = pca.transform(X_test_)
print("X_train shape: ", X_train_pca.shape)
print("X_test shape: ", X_test_pca.shape)
X_train shape: (350, 5)
X_test shape: (150, 5)
Criando Pipelines#
Os pipelines são ferramentas especialmente úteis quando queremos aplicar diferentes técnicas antes da modelagem. Além disso, este pode nos ajudar a evitar escrever longos scripts de código.
O Pipeline aplica sequencialmente uma lista de transformações e um estimador final. Desta maneira, para as etapas intermediárias ele aplica o fit e transform automaticamente.
A seguir exemplificamos como o pipeline funciona.
Sem o uso do Pipeline#
Queremos fazer as seguintes tarefas:
Inputação de valores
Normalização dos dados (z-score)
Redução via PCA para 5 features
Modelagem por meio do Logistic Regression
X, y = make_classification(n_samples=500, n_features=10, n_informative=5, random_state=0)
X[:25, 0] = np.nan
X[25:50, 1] = np.nan
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)
imp_mean = SimpleImputer(missing_values=np.nan, strategy="mean")
imp_mean.fit(X_train)
X_train_imp = imp_mean.transform(X_train)
X_test_imp = imp_mean.transform(X_test)
ss = StandardScaler()
ss.fit(X_train_imp)
X_train_ss = ss.transform(X_train_imp)
X_test_ss = ss.transform(X_test_imp)
pca = PCA(n_components=5, random_state=0)
pca.fit(X_train_ss)
X_train_pca = pca.transform(X_train_ss)
X_test_pca = pca.transform(X_test_ss)
lr = LogisticRegression(random_state=42)
model = lr.fit(X_train_pca, y_train)
preds = model.predict(X_test_pca)
predsp = model.predict_proba(X_test_pca)
scores = model.score(X_test_pca, y_test)
print("Model: ", lr)
print("Model hps: ", lr.get_params())
print("Model Score: ", scores)
print("Model prediction: \n", preds[:10])
print("Model prediction (proba): \n", predsp[:10])
Model: LogisticRegression(random_state=42)
Model hps: {'C': 1.0, 'class_weight': None, 'dual': False, 'fit_intercept': True, 'intercept_scaling': 1, 'l1_ratio': None, 'max_iter': 100, 'multi_class': 'auto', 'n_jobs': None, 'penalty': 'l2', 'random_state': 42, 'solver': 'lbfgs', 'tol': 0.0001, 'verbose': 0, 'warm_start': False}
Model Score: 0.7733333333333333
Model prediction:
[1 1 0 0 1 0 0 1 1 0]
Model prediction (proba):
[[0.27950121 0.72049879]
[0.42193184 0.57806816]
[0.96734767 0.03265233]
[0.58694956 0.41305044]
[0.40789539 0.59210461]
[0.98215815 0.01784185]
[0.51741084 0.48258916]
[0.36402651 0.63597349]
[0.25303639 0.74696361]
[0.87368411 0.12631589]]
Com o uso do Pipeline#
Queremos fazer as mesmas tarefas anteriores:
Inputação de valores
Normalização dos dados (z-score)
Redução via PCA para 5 features
Modelagem por meio do Logistic Regression
from sklearn.pipeline import Pipeline
steps = [
('si', SimpleImputer(missing_values=np.nan, strategy="mean")),
('ss', StandardScaler()),
('pca', PCA(n_components=5, random_state=42)),
('lrc', LogisticRegression(random_state=42))
]
pipe = Pipeline(steps)
pipe.fit(X_train, y_train)
preds = pipe.predict(X_test)
predsp = pipe.predict_proba(X_test)
scores = pipe.score(X_test, y_test)
print("Model: ", pipe)
print("Model Score: ", scores)
print("Model prediction: \n", preds[:10])
print("Model prediction (proba): \n", predsp[:10])
Model: Pipeline(steps=[('si', SimpleImputer()), ('ss', StandardScaler()),
('pca', PCA(n_components=5, random_state=42)),
('lrc', LogisticRegression(random_state=42))])
Model Score: 0.7733333333333333
Model prediction:
[1 1 0 0 1 0 0 1 1 0]
Model prediction (proba):
[[0.27950121 0.72049879]
[0.42193184 0.57806816]
[0.96734767 0.03265233]
[0.58694956 0.41305044]
[0.40789539 0.59210461]
[0.98215815 0.01784185]
[0.51741084 0.48258916]
[0.36402651 0.63597349]
[0.25303639 0.74696361]
[0.87368411 0.12631589]]
Seleção de Modelos#
Métricas e pontuações: quantificando a qualidade das previsões (Evaluation)#
from sklearn.metrics import balanced_accuracy_score
y_true = [0, 1, 0, 0, 1, 0]
y_pred = [0, 1, 0, 0, 0, 1]
balanced_accuracy_score(y_true, y_pred)
0.625
Cross-validation: Avalição da Performance de Modelos (CV)#
Divisão em Treinamento e Teste#
from sklearn.model_selection import train_test_split
from sklearn import datasets
X, y = datasets.load_iris(return_X_y=True)
[i.shape for i in [X, y]]
[(150, 4), (150,)]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
[i.shape for i in [X_train, X_test, y_train, y_test]]
[(105, 4), (45, 4), (105,), (45,)]
Cross Validation#
from sklearn.model_selection import cross_val_score
clf = DecisionTreeClassifier(random_state=42)
scores = cross_val_score(clf, X_train, y_train, cv=10, scoring="balanced_accuracy")
scores
array([0.91666667, 1. , 1. , 0.75 , 0.83333333,
1. , 1. , 0.80555556, 1. , 0.91666667])
Tuning the hyper-parameters of an estimator (HPT)#
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import make_moons
X, y = make_moons()
tune_parameters = {
"criterion": ["gini", "entropy"],
"max_depth": np.linspace(1, 32, 32, endpoint=True),
"min_samples_leaf": np.linspace(0.01, 0.1, 5, endpoint=True)
}
clf = GridSearchCV(
estimator=DecisionTreeClassifier(),
param_grid=tune_parameters,
scoring="balanced_accuracy",
cv=10,
refit=True
)
clf.fit(X_train, y_train)
GridSearchCV(cv=10, estimator=DecisionTreeClassifier(),
param_grid={'criterion': ['gini', 'entropy'],
'max_depth': array([ 1., 2., 3., 4., 5., 6., 7., 8., 9., 10., 11., 12., 13.,
14., 15., 16., 17., 18., 19., 20., 21., 22., 23., 24., 25., 26.,
27., 28., 29., 30., 31., 32.]),
'min_samples_leaf': array([0.01 , 0.0325, 0.055 , 0.0775, 0.1 ])},
scoring='balanced_accuracy')
clf.score(X_test, y_test)
1.0
Salvando e Carregando modelos#
É possível salvar um modelo do scikit-learn usando o protocolo de serialização e de-serialização de objetos em Python chamado pickle:
Salvando um modelo#
from sklearn.datasets import make_blobs
from sklearn.linear_model import LogisticRegression
X, y = make_blobs(n_samples=500, n_features=2, centers=3, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)
lr = LogisticRegression()
model = lr.fit(X_train, y_train)
import pickle
model_name = "my_model.pkl"
with open(model_name, 'wb') as file:
pickle.dump(model, file)
Carregando um Modelo#
import pickle
model_name = "my_model.pkl"
with open(model_name, 'rb') as file:
pickled_model = pickle.load(file)
preds = pickled_model.predict(X_test)
predsp = pickled_model.predict_proba(X_test)
scores = pickled_model.score(X_test, y_test)
print("Model Score: ", scores)
print("Model prediction: \n", preds[:10])
print("Model prediction (proba): \n", predsp[:10])
Model Score: 1.0
Model prediction:
[0 2 2 2 2 1 1 2 0 1]
Model prediction (proba):
[[0.99985755 0.0000996 0.00004284]
[0.00001745 0.00016558 0.99981698]
[0.00004946 0.00008208 0.99986846]
[0.00004331 0.0001275 0.99982919]
[0.00002165 0.00026253 0.99971582]
[0.00232173 0.99603898 0.00163929]
[0.00024187 0.99971246 0.00004567]
[0.00038217 0.00214574 0.99747209]
[0.99954522 0.00037273 0.00008205]
[0.00190224 0.99772551 0.00037225]]
Referências & Links#
Prática 2: Tarefas de Aprendizado de Máquina#
Imports#
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_moons, make_blobs, make_regression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.cluster import AgglomerativeClustering
from yellowbrick.contrib.classifier import DecisionViz
from yellowbrick.regressor import ResidualsPlot
from yellowbrick.regressor import PredictionError
# plt.rcParams["figure.figsize"] = (10,10)
np.set_printoptions(suppress=True)
Tarefas de AM#
Existem diversos tipos de Tarefas de Aprendizado de Máquina. A figura a seguir demonstra os mais importantes:
Dentre as mais importantes destacamos a classificação, a regressão e a clusterização.
Entender qual a tarefa mais adequada para um problema é uma das tarefas do Cientista de Dados.
Aprendizado Supervisionado#
É a tarefa que aprende uma função que mapeia uma entrada a uma saída por meio de pares de exemplos de entrada e saída.
Classificação#
Na classificação a saída é discreta, por exemplo:
Se vai chover ou fazer sol;
Se o diagnóstico para uma doença é positivo ou negativo;
Se houve ou não fraude em uma dada transferência bancária;
Se o cliente vai preferir o produto A, B ou C.
X, y = make_blobs(n_samples=500, n_features=2, centers=3, random_state=0)
df = pd.DataFrame(
{
'feature-1': X[:, 0],
'feature-2': X[:, 1],
'target': y
}
)
features = ["feature-1", "feature-2"]
target = "target"
sns.scatterplot(data=df, x=features[0], y=features[1], hue=target);
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)
list(map(lambda x: x.shape, [X_train, X_test, y_train, y_test]))
[(350, 2), (150, 2), (350,), (150,)]
clf = LogisticRegression(random_state=42)
viz = DecisionViz(
clf,
title="Decision Tree",
features=features
)
viz.fit(X_train, y_train)
viz.draw(X_test, y_test)
viz.show()
<AxesSubplot:xlabel='feature-1', ylabel='feature-2'>
clf = DecisionTreeClassifier(
criterion="entropy",
max_depth=5,
min_samples_split=10,
random_state=42
)
viz = DecisionViz(
clf,
title="Decision Tree",
features=features
)
viz.fit(X_train, y_train)
viz.draw(X_test, y_test)
viz.show()
<AxesSubplot:xlabel='feature-1', ylabel='feature-2'>
clf = KNeighborsClassifier(
n_neighbors=5,
metric="euclidean"
)
viz = DecisionViz(
clf,
title="Nearest Neighbors",
features=features
)
viz.fit(X_train, y_train)
viz.draw(X_test, y_test)
viz.show()
C:\Users\mlfernandez\Anaconda3\lib\site-packages\sklearn\neighbors\_classification.py:228: FutureWarning: Unlike other reduction functions (e.g. `skew`, `kurtosis`), the default behavior of `mode` typically preserves the axis it acts along. In SciPy 1.11.0, this behavior will change: the default value of `keepdims` will become False, the `axis` over which the statistic is taken will be eliminated, and the value None will no longer be accepted. Set `keepdims` to True or False to avoid this warning.
mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
<AxesSubplot:xlabel='feature-1', ylabel='feature-2'>
Regressão#
Na regressão a saída é contínua, por exemplo:
O valor de uma ação na semana seguinte;
A temperatura do dia seguinte;
A quantidade de produtos vendidos em uma campanha;
O preço de um insumo nos próximos meses.
X, y = make_regression(
n_samples=500,
n_features=2,
n_informative=2,
noise=30,
random_state=1
)
df = pd.DataFrame(
{
'feature-1': X[:, 0],
'feature-2': X[:, 1],
'target': y
}
)
features = ["feature-1", "feature-2"]
target = "target"
sns.scatterplot(data=df, x=features[0], y=features[1], hue=target)
sns.relplot(data=df, x=features[0], y=target)
sns.rugplot(data=df, x=features[0], y=target)
sns.relplot(data=df, x=features[1], y=target)
sns.rugplot(data=df, x=features[1], y=target)
<AxesSubplot:xlabel='feature-2', ylabel='target'>
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)
list(map(lambda x: x.shape, [X_train, X_test, y_train, y_test]))
[(350, 2), (150, 2), (350,), (150,)]
model = LinearRegression()
visualizer = PredictionError(model)
visualizer.fit(X_train, y_train) # Fit the training data to the visualizer
visualizer.score(X_test, y_test) # Evaluate the model on the test data
visualizer.show() # Finalize and render the figure
<AxesSubplot:title={'center':'Prediction Error for LinearRegression'}, xlabel='$y$', ylabel='$\\hat{y}$'>
Aprendizado Não-Supervisionado#
Nenhum rótulo é previamente conhecido, desta maneira os dados de entrada são usados para descobrir padrões ocultos e agregá-los de alguma forma.
Agrupamento#
Exemplo:
Segmentação de clientes
Motor de recomendação
Social Network Analysis (SNA)
X, _ = make_blobs(n_samples=500, n_features=4, centers=3, random_state=1)
df = pd.DataFrame(
{
'feature-1': X[:, 0],
'feature-2': X[:, 1],
}
)
features = ["feature-1", "feature-2"]
target = "target"
sns.scatterplot(data=df, x=features[0], y=features[1])
plt.show()
for c in [2, 3, 4]:
cluster = AgglomerativeClustering(n_clusters=c, affinity='euclidean', linkage='ward')
y_ = cluster.fit_predict(X)
sns.scatterplot(data=df, x=features[0], y=features[1], hue = y_)
plt.show()
© ealcobaca (2022)
Prática 3 - Case: Classificação de Câncer por meio de microRNA#
Descrição & Objetivo#
Descrição dos Dados: Os dados foram coletados do The Cancer Genome Atlas (TCGA), que é um programa internacional e de referência mundial de caracterização de mais de 33 tipos de câncer. Os dados são reais e foram devidamente anonimizados. Cada linha representa a amostra retirada de uma pessoa. As colunas são os tipos de microRNA e cada entrada representa a intensidade com que aquele microRNA está expresso. Os valores de expressão variam entre \([0, \infty]\). Valores próximos a zero indicam pouca expressão enquanto que o contrário indica uma alta expressão. Os dados também apresentam rótulos (veja o atributo class) sendo TP (primary solid tumor) indicando tumor e NT (normal tissue).
Objetivo: Construir um modelo para predizer quando uma pessoa tem câncer dado um exame de sequenciamento do RNA.
Leitura dos Dados#
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import balanced_accuracy_score
from sklearn.model_selection import cross_val_score
from sklearn.
import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import MinMaxScaler
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.decomposition import PCA
# plt.rcParams["figure.figsize"] = (15,15)
# sns.set(rc={'figure.figsize':(150, 150)})
df = pd.read_csv("brca_mirnaseq.csv", sep=';', header=0, decimal=',')
df
| hsa.let.7a.1 | hsa.let.7a.2 | hsa.let.7a.3 | hsa.let.7b | hsa.let.7c | hsa.let.7d | hsa.let.7e | hsa.let.7f.1 | hsa.let.7f.2 | hsa.let.7g | ... | hsa.mir.941.1 | hsa.mir.942 | hsa.mir.943 | hsa.mir.944 | hsa.mir.95 | hsa.mir.96 | hsa.mir.98 | hsa.mir.99a | hsa.mir.99b | class | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 8962.996542 | 17779.575039 | 9075.200383 | 24749.898857 | 341.298400 | 406.164781 | 1470.179650 | 14.716795 | 3627.642977 | 387.417272 | ... | 0.0 | 5.530515 | 0.187475 | 2.062226 | 4.124452 | 119.984057 | 53.992826 | 130.201449 | 46548.939810 | TP |
| 1 | 7739.739862 | 15524.941906 | 7713.626636 | 23374.640471 | 801.487258 | 513.297924 | 560.962427 | 20.922042 | 6557.093894 | 350.955461 | ... | 0.0 | 8.180047 | 0.000000 | 0.629234 | 1.258469 | 60.249189 | 86.047798 | 236.434808 | 12644.149725 | TP |
| 2 | 8260.612670 | 16497.981335 | 8355.342958 | 10957.355911 | 635.811272 | 620.351816 | 2694.331127 | 39.799878 | 11830.760394 | 600.725980 | ... | 0.0 | 3.618171 | 0.000000 | 0.767491 | 1.644623 | 97.252043 | 117.645369 | 191.434123 | 33083.456616 | TP |
| 3 | 9056.241254 | 18075.168478 | 9097.666150 | 26017.522731 | 2919.348415 | 334.245155 | 1322.434475 | 17.866463 | 6438.725384 | 354.957604 | ... | 0.0 | 3.478426 | 0.000000 | 3.478426 | 1.739213 | 72.572624 | 41.583007 | 1046.690127 | 24067.232290 | TP |
| 4 | 10897.303665 | 21822.338727 | 10963.956320 | 22204.253575 | 3313.009950 | 350.615669 | 1711.886682 | 22.541895 | 8246.117280 | 333.425447 | ... | 0.0 | 2.108235 | 0.000000 | 1.135203 | 0.810860 | 19.947145 | 34.380445 | 1081.037952 | 25715.275426 | TP |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 837 | 10628.975280 | 21125.108661 | 10585.686678 | 23396.813364 | 3892.051211 | 367.141461 | 1484.663795 | 23.402901 | 10570.535667 | 571.680109 | ... | 0.0 | 1.217492 | 0.000000 | 0.405831 | 1.217492 | 79.813361 | 57.627952 | 1100.883277 | 16338.471420 | TP |
| 838 | 16799.785282 | 33603.904432 | 16883.338223 | 20731.006597 | 5263.331356 | 201.676038 | 2173.283559 | 36.888271 | 18227.341203 | 870.301142 | ... | 0.0 | 5.341744 | 0.000000 | 3.124416 | 2.318115 | 16.629958 | 57.348159 | 1919.601107 | 14080.736733 | TP |
| 839 | 13120.807001 | 26337.935723 | 13229.425112 | 18796.895124 | 6581.549565 | 375.598820 | 2547.029500 | 28.505268 | 16838.042944 | 778.398745 | ... | 0.0 | 1.863089 | 0.000000 | 0.558927 | 0.931545 | 41.919511 | 54.215901 | 1310.124456 | 17072.605898 | TP |
| 840 | 7979.531224 | 16006.280243 | 8106.687917 | 20462.010937 | 4040.296936 | 295.594442 | 962.166120 | 23.885025 | 7625.121634 | 428.411748 | ... | 0.0 | 2.070956 | 0.000000 | 2.209020 | 1.656765 | 55.225491 | 53.016472 | 1120.939408 | 18696.866174 | TP |
| 841 | 10439.110392 | 20880.967721 | 10649.126224 | 17770.685685 | 1330.766196 | 790.868182 | 1952.822603 | 29.966587 | 10936.555740 | 577.855691 | ... | 0.0 | 11.487192 | 0.000000 | 3.745823 | 2.746937 | 44.949881 | 80.160621 | 470.225698 | 34080.000799 | TP |
842 rows × 898 columns
df.shape
(842, 898)
Análise Exploratória dos Dados#
ax = sns.countplot(x="class", data=df)
df["class"].value_counts()
TP 755
NT 87
Name: class, dtype: int64
df["class"].value_counts(normalize=True)
TP 0.896675
NT 0.103325
Name: class, dtype: float64
df.describe()
| hsa.let.7a.1 | hsa.let.7a.2 | hsa.let.7a.3 | hsa.let.7b | hsa.let.7c | hsa.let.7d | hsa.let.7e | hsa.let.7f.1 | hsa.let.7f.2 | hsa.let.7g | ... | hsa.mir.940 | hsa.mir.941.1 | hsa.mir.942 | hsa.mir.943 | hsa.mir.944 | hsa.mir.95 | hsa.mir.96 | hsa.mir.98 | hsa.mir.99a | hsa.mir.99b | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 842.000000 | 842.000000 | 842.000000 | 842.000000 | 842.000000 | 842.000000 | 842.000000 | 842.000000 | 842.000000 | 842.000000 | ... | 842.000000 | 842.000000 | 842.000000 | 842.000000 | 842.000000 | 842.000000 | 842.000000 | 842.000000 | 842.000000 | 842.000000 |
| mean | 9218.938921 | 18432.504585 | 9289.250466 | 26606.604836 | 3152.699471 | 558.321269 | 1289.570177 | 24.359962 | 8687.461926 | 610.223836 | ... | 5.902975 | 0.003737 | 6.446279 | 0.061018 | 2.320737 | 3.150482 | 38.307053 | 63.746405 | 1034.572148 | 44369.112203 |
| std | 4843.796136 | 9704.187427 | 4858.691217 | 16745.347957 | 3238.003201 | 346.883205 | 763.056055 | 12.490091 | 6052.615278 | 317.854963 | ... | 8.325681 | 0.049274 | 9.541682 | 0.172214 | 6.527536 | 4.287594 | 33.791795 | 40.145314 | 1117.491608 | 32754.290751 |
| min | 1294.149164 | 2599.981125 | 1319.952907 | 1817.920354 | 148.795934 | 79.783216 | 161.181457 | 2.439034 | 653.474578 | 88.614573 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 3.374223 | 18.400719 | 3475.079227 |
| 25% | 5902.143848 | 11741.467528 | 5933.706564 | 14580.357100 | 1276.700850 | 330.638301 | 809.867504 | 16.441786 | 4648.822942 | 410.859815 | ... | 1.378098 | 0.000000 | 2.464140 | 0.000000 | 0.373238 | 1.201951 | 14.906921 | 39.913493 | 387.430475 | 22769.094433 |
| 50% | 8016.628565 | 16040.589880 | 8103.783439 | 23097.825936 | 2352.902327 | 481.342371 | 1101.403395 | 21.890340 | 7019.157941 | 532.277053 | ... | 3.192098 | 0.000000 | 4.127957 | 0.000000 | 1.036215 | 2.235731 | 29.634884 | 52.993693 | 710.026124 | 35594.670263 |
| 75% | 11236.887034 | 22538.594950 | 11289.595988 | 34373.185504 | 3971.192192 | 681.931022 | 1619.864372 | 29.395515 | 10926.448322 | 724.277709 | ... | 7.159431 | 0.000000 | 7.551755 | 0.000000 | 2.345941 | 4.030888 | 51.258145 | 75.993914 | 1242.434228 | 53462.034662 |
| max | 45101.697434 | 90233.655610 | 45095.490102 | 144706.427973 | 59677.212349 | 3370.036117 | 11617.011618 | 121.408006 | 80780.055188 | 3342.745045 | ... | 91.996543 | 0.909391 | 184.185656 | 1.757516 | 122.685820 | 93.402785 | 259.127121 | 399.078716 | 15689.499524 | 248074.178531 |
8 rows × 897 columns
Estabelecendo um Baseline Comparativo#
Antes de qualque modelagem vamos estabelecer um baseline, i.e., uma solução simples para o problema. A ideia de adicionarmos um baseline é para guiar o experimento, de forma a compreender se todas as técnicas que estamos usando são realmente necessárias e impactam positivamente a solução. Sem um baseline o experimento acontece de forma cega e não dá para dizer se estamos seguindo o caminho certo.
X = df.drop("class", axis=1)
y = df["class"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, stratify=y, random_state=42)
y_train.value_counts(normalize=True)
TP 0.896435
NT 0.103565
Name: class, dtype: float64
y_test.value_counts(normalize=True)
TP 0.897233
NT 0.102767
Name: class, dtype: float64
lrc = LogisticRegression(random_state=42)
cv_list_lr_baseline = cross_val_score(
lrc,
X_train,
y_train,
cv=10,
scoring="balanced_accuracy"
)
C:\Users\mlfernandez\Anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\mlfernandez\Anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\mlfernandez\Anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\mlfernandez\Anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\mlfernandez\Anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\mlfernandez\Anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\mlfernandez\Anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\mlfernandez\Anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\mlfernandez\Anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\mlfernandez\Anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
mean_cv_lr_baseline = np.mean(cv_list_lr_baseline)
std_cv_lr_baseline = np.std(cv_list_lr_baseline)
print(f"Preformance (bac): {round(mean_cv_lr_baseline, 4)} +- {round(std_cv_lr_baseline, 4)}")
Preformance (bac): 0.9201 +- 0.046
Modelagem#
knn = Pipeline(
[
('mms', MinMaxScaler()),
('skb', SelectKBest(chi2, k=10)),
('knn', KNeighborsClassifier(
n_neighbors=3, # número de vizinhos
p=2, # parâmetro da medida Minkowski metric. p=2 é a distância euclideana
weights="uniform", # peso de cada exemplo.
)
)
]
)
cv_list_knn_euclid = cross_val_score(
knn,
X_train,
y_train,
cv=10,
scoring="balanced_accuracy"
)
mean_cv_knn_euclid = np.mean(cv_list_knn_euclid)
std_cv_knn_euclid = np.std(cv_list_knn_euclid)
print(f"Preformance (bac): {round(mean_cv_knn_euclid, 4)} +- {round(std_cv_knn_euclid, 4)}")
C:\Users\mlfernandez\Anaconda3\lib\site-packages\sklearn\neighbors\_classification.py:228: FutureWarning: Unlike other reduction functions (e.g. `skew`, `kurtosis`), the default behavior of `mode` typically preserves the axis it acts along. In SciPy 1.11.0, this behavior will change: the default value of `keepdims` will become False, the `axis` over which the statistic is taken will be eliminated, and the value None will no longer be accepted. Set `keepdims` to True or False to avoid this warning.
mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
C:\Users\mlfernandez\Anaconda3\lib\site-packages\sklearn\neighbors\_classification.py:228: FutureWarning: Unlike other reduction functions (e.g. `skew`, `kurtosis`), the default behavior of `mode` typically preserves the axis it acts along. In SciPy 1.11.0, this behavior will change: the default value of `keepdims` will become False, the `axis` over which the statistic is taken will be eliminated, and the value None will no longer be accepted. Set `keepdims` to True or False to avoid this warning.
mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
C:\Users\mlfernandez\Anaconda3\lib\site-packages\sklearn\neighbors\_classification.py:228: FutureWarning: Unlike other reduction functions (e.g. `skew`, `kurtosis`), the default behavior of `mode` typically preserves the axis it acts along. In SciPy 1.11.0, this behavior will change: the default value of `keepdims` will become False, the `axis` over which the statistic is taken will be eliminated, and the value None will no longer be accepted. Set `keepdims` to True or False to avoid this warning.
mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
C:\Users\mlfernandez\Anaconda3\lib\site-packages\sklearn\neighbors\_classification.py:228: FutureWarning: Unlike other reduction functions (e.g. `skew`, `kurtosis`), the default behavior of `mode` typically preserves the axis it acts along. In SciPy 1.11.0, this behavior will change: the default value of `keepdims` will become False, the `axis` over which the statistic is taken will be eliminated, and the value None will no longer be accepted. Set `keepdims` to True or False to avoid this warning.
mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
C:\Users\mlfernandez\Anaconda3\lib\site-packages\sklearn\neighbors\_classification.py:228: FutureWarning: Unlike other reduction functions (e.g. `skew`, `kurtosis`), the default behavior of `mode` typically preserves the axis it acts along. In SciPy 1.11.0, this behavior will change: the default value of `keepdims` will become False, the `axis` over which the statistic is taken will be eliminated, and the value None will no longer be accepted. Set `keepdims` to True or False to avoid this warning.
mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
C:\Users\mlfernandez\Anaconda3\lib\site-packages\sklearn\neighbors\_classification.py:228: FutureWarning: Unlike other reduction functions (e.g. `skew`, `kurtosis`), the default behavior of `mode` typically preserves the axis it acts along. In SciPy 1.11.0, this behavior will change: the default value of `keepdims` will become False, the `axis` over which the statistic is taken will be eliminated, and the value None will no longer be accepted. Set `keepdims` to True or False to avoid this warning.
mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
C:\Users\mlfernandez\Anaconda3\lib\site-packages\sklearn\neighbors\_classification.py:228: FutureWarning: Unlike other reduction functions (e.g. `skew`, `kurtosis`), the default behavior of `mode` typically preserves the axis it acts along. In SciPy 1.11.0, this behavior will change: the default value of `keepdims` will become False, the `axis` over which the statistic is taken will be eliminated, and the value None will no longer be accepted. Set `keepdims` to True or False to avoid this warning.
mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
Preformance (bac): 0.9703 +- 0.0377
C:\Users\mlfernandez\Anaconda3\lib\site-packages\sklearn\neighbors\_classification.py:228: FutureWarning: Unlike other reduction functions (e.g. `skew`, `kurtosis`), the default behavior of `mode` typically preserves the axis it acts along. In SciPy 1.11.0, this behavior will change: the default value of `keepdims` will become False, the `axis` over which the statistic is taken will be eliminated, and the value None will no longer be accepted. Set `keepdims` to True or False to avoid this warning.
mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
C:\Users\mlfernandez\Anaconda3\lib\site-packages\sklearn\neighbors\_classification.py:228: FutureWarning: Unlike other reduction functions (e.g. `skew`, `kurtosis`), the default behavior of `mode` typically preserves the axis it acts along. In SciPy 1.11.0, this behavior will change: the default value of `keepdims` will become False, the `axis` over which the statistic is taken will be eliminated, and the value None will no longer be accepted. Set `keepdims` to True or False to avoid this warning.
mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
C:\Users\mlfernandez\Anaconda3\lib\site-packages\sklearn\neighbors\_classification.py:228: FutureWarning: Unlike other reduction functions (e.g. `skew`, `kurtosis`), the default behavior of `mode` typically preserves the axis it acts along. In SciPy 1.11.0, this behavior will change: the default value of `keepdims` will become False, the `axis` over which the statistic is taken will be eliminated, and the value None will no longer be accepted. Set `keepdims` to True or False to avoid this warning.
mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
knn = Pipeline(
[
('mms', MinMaxScaler()),
('skb', SelectKBest(chi2, k=10)),
('knn', KNeighborsClassifier(
n_neighbors=3, # número de vizinhos
p=1, # parâmetro da medida Minkowski metric. p=2 é a distância euclideana
weights="uniform", # peso de cada exemplo
)
)
]
)
cv_list_knn_man = cross_val_score(
knn,
X_train,
y_train,
cv=10,
scoring="balanced_accuracy"
)
mean_cv_knn_man = np.mean(cv_list_knn_man)
std_cv_knn_man = np.std(cv_list_knn_man)
print(f"Preformance (bac): {round(mean_cv_knn_man, 4)} +- {round(std_cv_knn_man, 4)}")
C:\Users\mlfernandez\Anaconda3\lib\site-packages\sklearn\neighbors\_classification.py:228: FutureWarning: Unlike other reduction functions (e.g. `skew`, `kurtosis`), the default behavior of `mode` typically preserves the axis it acts along. In SciPy 1.11.0, this behavior will change: the default value of `keepdims` will become False, the `axis` over which the statistic is taken will be eliminated, and the value None will no longer be accepted. Set `keepdims` to True or False to avoid this warning.
mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
C:\Users\mlfernandez\Anaconda3\lib\site-packages\sklearn\neighbors\_classification.py:228: FutureWarning: Unlike other reduction functions (e.g. `skew`, `kurtosis`), the default behavior of `mode` typically preserves the axis it acts along. In SciPy 1.11.0, this behavior will change: the default value of `keepdims` will become False, the `axis` over which the statistic is taken will be eliminated, and the value None will no longer be accepted. Set `keepdims` to True or False to avoid this warning.
mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
C:\Users\mlfernandez\Anaconda3\lib\site-packages\sklearn\neighbors\_classification.py:228: FutureWarning: Unlike other reduction functions (e.g. `skew`, `kurtosis`), the default behavior of `mode` typically preserves the axis it acts along. In SciPy 1.11.0, this behavior will change: the default value of `keepdims` will become False, the `axis` over which the statistic is taken will be eliminated, and the value None will no longer be accepted. Set `keepdims` to True or False to avoid this warning.
mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
C:\Users\mlfernandez\Anaconda3\lib\site-packages\sklearn\neighbors\_classification.py:228: FutureWarning: Unlike other reduction functions (e.g. `skew`, `kurtosis`), the default behavior of `mode` typically preserves the axis it acts along. In SciPy 1.11.0, this behavior will change: the default value of `keepdims` will become False, the `axis` over which the statistic is taken will be eliminated, and the value None will no longer be accepted. Set `keepdims` to True or False to avoid this warning.
mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
C:\Users\mlfernandez\Anaconda3\lib\site-packages\sklearn\neighbors\_classification.py:228: FutureWarning: Unlike other reduction functions (e.g. `skew`, `kurtosis`), the default behavior of `mode` typically preserves the axis it acts along. In SciPy 1.11.0, this behavior will change: the default value of `keepdims` will become False, the `axis` over which the statistic is taken will be eliminated, and the value None will no longer be accepted. Set `keepdims` to True or False to avoid this warning.
mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
C:\Users\mlfernandez\Anaconda3\lib\site-packages\sklearn\neighbors\_classification.py:228: FutureWarning: Unlike other reduction functions (e.g. `skew`, `kurtosis`), the default behavior of `mode` typically preserves the axis it acts along. In SciPy 1.11.0, this behavior will change: the default value of `keepdims` will become False, the `axis` over which the statistic is taken will be eliminated, and the value None will no longer be accepted. Set `keepdims` to True or False to avoid this warning.
mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
C:\Users\mlfernandez\Anaconda3\lib\site-packages\sklearn\neighbors\_classification.py:228: FutureWarning: Unlike other reduction functions (e.g. `skew`, `kurtosis`), the default behavior of `mode` typically preserves the axis it acts along. In SciPy 1.11.0, this behavior will change: the default value of `keepdims` will become False, the `axis` over which the statistic is taken will be eliminated, and the value None will no longer be accepted. Set `keepdims` to True or False to avoid this warning.
mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
C:\Users\mlfernandez\Anaconda3\lib\site-packages\sklearn\neighbors\_classification.py:228: FutureWarning: Unlike other reduction functions (e.g. `skew`, `kurtosis`), the default behavior of `mode` typically preserves the axis it acts along. In SciPy 1.11.0, this behavior will change: the default value of `keepdims` will become False, the `axis` over which the statistic is taken will be eliminated, and the value None will no longer be accepted. Set `keepdims` to True or False to avoid this warning.
mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
Preformance (bac): 0.9638 +- 0.0407
C:\Users\mlfernandez\Anaconda3\lib\site-packages\sklearn\neighbors\_classification.py:228: FutureWarning: Unlike other reduction functions (e.g. `skew`, `kurtosis`), the default behavior of `mode` typically preserves the axis it acts along. In SciPy 1.11.0, this behavior will change: the default value of `keepdims` will become False, the `axis` over which the statistic is taken will be eliminated, and the value None will no longer be accepted. Set `keepdims` to True or False to avoid this warning.
mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
C:\Users\mlfernandez\Anaconda3\lib\site-packages\sklearn\neighbors\_classification.py:228: FutureWarning: Unlike other reduction functions (e.g. `skew`, `kurtosis`), the default behavior of `mode` typically preserves the axis it acts along. In SciPy 1.11.0, this behavior will change: the default value of `keepdims` will become False, the `axis` over which the statistic is taken will be eliminated, and the value None will no longer be accepted. Set `keepdims` to True or False to avoid this warning.
mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
lr = Pipeline(
[
('scaler', StandardScaler()),
('lr', LogisticRegression(
penalty="l2", # penalidade, usado para evitar overfitting
C=1, # força de regularização do modelo. Valores pequenos implicam em regularização mais forte
fit_intercept=True, # bias ou intercepto do modelo
class_weight="balanced", # peso das classes. Útil para datasets desbalanceados
random_state=42
)
)
])
cv_list_lr_l2 = cross_val_score(
lr,
X_train,
y_train,
cv=10,
scoring="balanced_accuracy"
)
mean_cv_lr_l2 = np.mean(cv_list_lr_l2)
std_cv_lr_l2 = np.std(cv_list_lr_l2)
print(f"Preformance (bac): {round(mean_cv_lr_l2, 4)} +- {round(std_cv_lr_l2, 4)}")
Preformance (bac): 0.9655 +- 0.0391
lr = Pipeline(
[
('scaler', StandardScaler()),
('lr', LogisticRegression(
penalty="l1", # penalidade, usado para evitar overfitting
C=1, # força de regularização do modelo. Valores pequenos implicam em regularização mais forte
fit_intercept=True, # bias ou intercepto do modelo
class_weight="balanced", # peso das classes. Útil para datasets desbalanceados
solver="liblinear",
random_state=42
)
)
])
cv_list_lr_l1 = cross_val_score(
lr,
X_train,
y_train,
cv=10,
scoring="balanced_accuracy"
)
mean_cv_lr_l1 = np.mean(cv_list_lr_l1)
std_cv_lr_l1 = np.std(cv_list_lr_l1)
print(f"Preformance (bac): {round(mean_cv_lr_l1, 4)} +- {round(std_cv_lr_l1, 4)}")
Preformance (bac): 0.9665 +- 0.0373
lr = Pipeline(
[
('scaler', StandardScaler()),
('pca', PCA(n_components=10)),
('lr', LogisticRegression(
penalty="l2", # penalidade, usado para evitar overfitting
C=1, # força de regularização do modelo. Valores pequenos implicam em regularização mais forte
fit_intercept=True, # bias ou intercepto do modelo
class_weight="balanced", # peso das classes. Útil para datasets desbalanceados
random_state=42
)
)
])
cv_list_lr_pca = cross_val_score(
lr,
X_train,
y_train,
cv=10,
scoring="balanced_accuracy"
)
mean_cv_lr_pca = np.mean(cv_list_lr_pca)
std_cv_lr_pca = np.std(cv_list_lr_pca)
print(f"Preformance (bac): {round(mean_cv_lr_pca, 4)} +- {round(std_cv_lr_pca, 4)}")
Preformance (bac): 0.9822 +- 0.0228
Avaliação Experimental#
# resultados da cross-validacao
df_result_cv = pd.DataFrame(
[cv_list_lr_baseline, cv_list_knn_euclid, cv_list_knn_man, cv_list_lr_l2, cv_list_lr_l1, cv_list_lr_pca],
index=["baseline", "kNN-eucli", "kNN-man","LR-L2", "LR-L1", "LR-PCA"]
).T
df_result_cv
| baseline | kNN-eucli | kNN-man | LR-L2 | LR-L1 | LR-PCA | |
|---|---|---|---|---|---|---|
| 0 | 0.907233 | 1.000000 | 0.916667 | 0.990566 | 0.990566 | 0.990566 |
| 1 | 0.990566 | 0.981132 | 0.990566 | 0.888365 | 0.981132 | 0.981132 |
| 2 | 0.971698 | 0.990566 | 0.990566 | 0.990566 | 0.990566 | 0.990566 |
| 3 | 0.907233 | 0.916667 | 0.916667 | 0.916667 | 0.907233 | 0.990566 |
| 4 | 0.907233 | 1.000000 | 1.000000 | 0.990566 | 1.000000 | 1.000000 |
| 5 | 0.916667 | 0.916667 | 0.916667 | 0.916667 | 0.916667 | 0.916667 |
| 6 | 0.907233 | 0.907233 | 0.907233 | 0.981132 | 0.907233 | 0.990566 |
| 7 | 0.878931 | 0.990566 | 1.000000 | 0.990566 | 0.981132 | 0.981132 |
| 8 | 0.980769 | 1.000000 | 1.000000 | 0.990385 | 0.990385 | 0.980769 |
| 9 | 0.833333 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 |
df_res = df_result_cv.stack().to_frame("balanced_accuracy")
df_res.index.rename(["fold", "pipelines"], inplace=True)
df_res = df_res.reset_index()
df_res.head(12)
| fold | pipelines | balanced_accuracy | |
|---|---|---|---|
| 0 | 0 | baseline | 0.907233 |
| 1 | 0 | kNN-eucli | 1.000000 |
| 2 | 0 | kNN-man | 0.916667 |
| 3 | 0 | LR-L2 | 0.990566 |
| 4 | 0 | LR-L1 | 0.990566 |
| 5 | 0 | LR-PCA | 0.990566 |
| 6 | 1 | baseline | 0.990566 |
| 7 | 1 | kNN-eucli | 0.981132 |
| 8 | 1 | kNN-man | 0.990566 |
| 9 | 1 | LR-L2 | 0.888365 |
| 10 | 1 | LR-L1 | 0.981132 |
| 11 | 1 | LR-PCA | 0.981132 |
plt.figure(figsize=(10,10))
ax = sns.boxplot(x="pipelines", y="balanced_accuracy", data=df_res)
ax = sns.swarmplot(x="pipelines", y="balanced_accuracy", data=df_res, color=".40")
plt.figure(figsize=(10,10))
sns.catplot(x="pipelines", y="balance
d_accuracy", kind="violin", data=df_res)
<seaborn.axisgrid.FacetGrid at 0x7f218c1d0fd0>
<Figure size 720x720 with 0 Axes>
O modelo que vamos selecionar será o Logistic Regression com PCA, pois este apresentou média/mediana competitiva e menor desvio padrão.
Por fim, vamos avaliar a performance final do nosso modelo:
# retreinar o pipeline selecionado com todos os dados de treinamento
lr = Pipeline(
[
('scaler', StandardScaler()),
('pca', PCA(n_components=10)),
('lr', LogisticRegression(
penalty="l2",
C=1,
fit_intercept=True,
class_weight="balanced",
random_state=42
)
)
])
lr.fit(X_train, y_train)
y_pred = lr.predict(X_test)
lr_pca_test = balanced_accuracy_score(y_test, y_pred)
print("Performance: ", round(lr_pca_test, 4))
Performance: 0.972
# Confusion matrix
from sklearn.metrics import ConfusionMatrixDisplay
ConfusionMatrixDisplay.from_estimator(lr, X_test, y_test)
plt.show()
ConfusionMatrixDisplay.from_estimator(lr, X_test, y_test, normalize='true')
plt.show()
Referências & Links#
Prática 4: Classificação de medicamentos usando árvore de decisão#
Descrição & Objetivo#
Descrição dos Dados: Uma empresa farmacêutica desenvolveu e testou dois diferentes medicamentos para o tratamento de uma doença. Os pesquisadores perceberam que o remédio A se comportava melhor para alguns pacientes enquanto que o B foi melhor para outro grupo de pacientes. Foram coletadas as seguintes características dos pacientes: idade (Age), sexo (Sex), pressão sanguínea (BP) e nível de colesterol (Cholesterol). Você foi acionado pela equipe para construir uma solução automática para recomendar o melhor medicamento. Contudo, como se trata de um medicamento, é esperado que esta recomendação seja transparente, i.e., o paciente precisa entender exatamente o motivo de tal recomendação.
Objetivo: Construir regras claras e bem definidas para recomendar o melhor medicamento dado as características do paciente.
1. Leitura dos Dados#
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import cross_val_score
from sklearn.tree import plot_tree, DecisionTreeClassifier
from sklearn.preprocessing import OrdinalEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import balanced_accuracy_score
# Leitura de dados
data = pd.read_csv("drug_data.csv")
data
| Age | Sex | BP | Cholesterol | Drug | |
|---|---|---|---|---|---|
| 0 | 23 | F | HIGH | HIGH | B |
| 1 | 28 | F | NORMAL | HIGH | A |
| 2 | 61 | F | LOW | HIGH | B |
| 3 | 22 | F | NORMAL | HIGH | A |
| 4 | 49 | F | NORMAL | HIGH | B |
| ... | ... | ... | ... | ... | ... |
| 140 | 72 | M | LOW | HIGH | B |
| 141 | 46 | F | HIGH | HIGH | B |
| 142 | 52 | M | NORMAL | HIGH | A |
| 143 | 23 | M | NORMAL | NORMAL | A |
| 144 | 40 | F | LOW | NORMAL | A |
145 rows × 5 columns
2. Análise Exploratória dos Dados#
# Contar classes
data["Drug"].value_counts()
B 91
A 54
Name: Drug, dtype: int64
# Porcentagem de exemplos de cada classe
data["Drug"].value_counts(True)
B 0.627586
A 0.372414
Name: Drug, dtype: float64
# Verificar se tem NaN
data.isna().sum()
Age 0
Sex 0
BP 0
Cholesterol 0
Drug 0
dtype: int64
# Descrever dados categóricos
data.describe(include=[object])
| Sex | BP | Cholesterol | Drug | |
|---|---|---|---|---|
| count | 145 | 145 | 145 | 145 |
| unique | 2 | 3 | 2 | 2 |
| top | F | NORMAL | NORMAL | B |
| freq | 74 | 59 | 78 | 91 |
# Descrever dados numéricos
data.describe(include=[np.number])
| Age | |
|---|---|
| count | 145.000000 |
| mean | 43.848276 |
| std | 16.755319 |
| min | 15.000000 |
| 25% | 30.000000 |
| 50% | 43.000000 |
| 75% | 58.000000 |
| max | 74.000000 |
3. Modelagem & Avaliação#
3.1 Baseline#
# Separar os dados em treinamento e teste
categorical_features = ["Sex", "BP", "Cholesterol"]
numerical_features = ["Age"]
X = data.drop(columns=["Drug"])
y = data["Drug"].astype('category').cat.codes
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y, random_state=42)
lrc = LogisticRegression(random_state=42)
cv_list_lr_baseline = cross_val_score(
lrc,
X_train[numerical_features],
y_train,
cv=10,
scoring="balanced_accuracy"
)
mean_cv_lr_baseline = np.mean(cv_list_lr_baseline)
std_cv_lr_baseline = np.std(cv_list_lr_baseline)
print(f"Performance (bac): {round(mean_cv_lr_baseline, 4)} +- {round(std_cv_lr_baseline, 4)}")
Performance (bac): 0.5 +- 0.0
3.2 Tratando Dados Categóricos#
Embora os algoritmos de árvore de decisão tenham sido inicialmente criados para lidar com dados categóricos, os algoritmos do sklearn só trabalham com dados numéricos. Portanto, precisamos transformar os atributos categóricos para numéricos usando, por exemplo, o OneHotEncoder ou OrdinalEncoder.
# Esse comando vai transformar as categorias em números
oe = OrdinalEncoder()
oe.fit(data[["Sex", "BP"]])
oe.transform(data[["Sex", "BP"]])[:10]
array([[0., 0.],
[0., 2.],
[0., 1.],
[0., 2.],
[0., 2.],
[1., 2.],
[1., 1.],
[0., 0.],
[1., 1.],
[0., 1.]])
# Podemos transformar as features de forma diferente
# Para isso podemos usar o ColumnTransformer
tr = ColumnTransformer(
transformers=[
("cat-enc", OrdinalEncoder(), categorical_features),
# ("min-max", MinMaxScaler(), numerical_features)
]
)
tr.fit_transform(X_train)[:10]
array([[0., 2., 1.],
[1., 1., 1.],
[0., 1., 1.],
[0., 0., 1.],
[1., 1., 1.],
[1., 2., 1.],
[0., 1., 1.],
[0., 0., 0.],
[0., 2., 0.],
[0., 0., 0.]])
# Podemos transformar as features de forma diferente
# Para isso podemos usar o ColumnTransformer
tr = ColumnTransformer(
transformers=[
("cat-enc", OrdinalEncoder(), categorical_features),
("min-max", MinMaxScaler(), numerical_features)
]
)
tr.fit_transform(X_train)[:10]
array([[0. , 2. , 1. , 0.08474576],
[1. , 1. , 1. , 0.50847458],
[0. , 1. , 1. , 0.61016949],
[0. , 0. , 1. , 0.06779661],
[1. , 1. , 1. , 0.57627119],
[1. , 2. , 1. , 0.13559322],
[0. , 1. , 1. , 0.08474576],
[0. , 0. , 0. , 0.72881356],
[0. , 2. , 0. , 0.42372881],
[0. , 0. , 0. , 0.45762712]])
# Podemos transformar as features de forma diferente
# Para isso podemos usar o ColumnTransformer
tr = ColumnTransformer(
transformers=[
("cat-enc", OrdinalEncoder(), categorical_features),
# ("min-max", MinMaxScaler(), numerical_features)
],
remainder='passthrough' # passa para frente demais features
)
tr.fit_transform(X_train)[:10]
array([[ 0., 2., 1., 20.],
[ 1., 1., 1., 45.],
[ 0., 1., 1., 51.],
[ 0., 0., 1., 19.],
[ 1., 1., 1., 49.],
[ 1., 2., 1., 23.],
[ 0., 1., 1., 20.],
[ 0., 0., 0., 58.],
[ 0., 2., 0., 40.],
[ 0., 0., 0., 42.]])
3.2 Modelando uma árvore de decisão#
Para isso, podemos usar a classe DecisionTreeClassifier, que gera modelos do algoritmo DT.
Observe que é necessário configurar os parâmetros que controlam o tamanho da árvore para evitar overfitting. Além disso, perdemos o entendimento do modelo ao criar árvores muito compridas. Veja a nota na documentação do algoritmo:
Notes
The default values for the parameters controlling the size of the trees (e.g. max_depth, min_samples_leaf, etc.) lead to fully grown and unpruned trees which can potentially be very large on some data sets. To reduce memory consumption, the complexity and size of the trees should be controlled by setting those parameter values.
Sem limitar tamanho da árvore#
categorical_features = ["Sex", "BP", "Cholesterol"]
ct = ColumnTransformer(
transformers=[
("cat", OrdinalEncoder(), categorical_features),
],
remainder='passthrough' # passa para frente demais features
)
dt = DecisionTreeClassifier(
criterion="gini", # critério para medir a qualidade da divisão (split)
class_weight="balanced", # atribuir o mesmo peso para as duas classes
random_state=42
)
pipe1 = Pipeline([
('preprocessing-1', ct),
('model', dt)
])
cv_list_dt = cross_val_score(
pipe1,
X_train,
y_train,
cv=10,
scoring="balanced_accuracy"
)
mean_cv_dt = np.mean(cv_list_dt)
std_cv_dt = np.std(cv_list_dt)
print(f"Performance (bac): {round(mean_cv_dt, 4)} +- {round(std_cv_dt, 4)}")
Performance (bac): 0.6643 +- 0.1602
pipe1.fit(X_train, y_train)
model = pipe1["model"]
X_train_prep = pipe1["preprocessing-1"].transform(X_train)
feature_names = pipe1["preprocessing-1"].transformers_[0][2] + X_train.columns[pipe1["preprocessing-1"].transformers_[1][2]].to_list()
class_names = ["A", "B"]
plt.figure(figsize=(15,10))
plot_tree(
model.fit(X_train_prep, y_train),
feature_names=feature_names,
class_names=class_names,
filled=True,
proportion=True
)
plt.show(True)
Limitando o tamanho da árvore#
categorical_features = ["Sex", "BP", "Cholesterol"]
ct = ColumnTransformer(
transformers=[
("cat", OrdinalEncoder(), categorical_features),
],
remainder='passthrough' # passa para frente demais features
)
dt = DecisionTreeClassifier(
criterion="gini", # critério para medir a qualidade da divisão (split)
max_depth=3, # profundidade máxima da árvore
min_samples_leaf=0.01, # número mínimo de folhas em um nó - aqui no caso em porcentagem = 1%
class_weight="balanced", # atribuir o mesmo peso para as duas classes
random_state=42
)
pipe2 = Pipeline([
('preprocessing-1', ct),
('model', dt)
])
cv_list_dt2 = cross_val_score(
pipe2,
X_train,
y_train,
cv=10,
scoring="balanced_accuracy"
)
mean_cv_dt2 = np.mean(cv_list_dt2)
std_cv_dt2 = np.std(cv_list_dt2)
print(f"Performance (bac): {round(mean_cv_dt2, 4)} +- {round(std_cv_dt2, 4)}")
Performance (bac): 0.7845 +- 0.1012
pipe2.fit(X_train, y_train)
model = pipe2["model"]
X_train_prep = pipe2["preprocessing-1"].transform(X_train)
feature_names = pipe2["preprocessing-1"].transformers_[0][2] + X_train.columns[pipe2["preprocessing-1"].transformers_[1][2]].to_list()
class_names = ["A", "B"]
plt.figure(figsize=(15,10))
plot_tree(
model.fit(X_train_prep, y_train),
feature_names=feature_names,
class_names=class_names,
filled=True,
proportion=True
)
plt.show(True)
# retreinar o pipeline selecionado com todos os dados de treinamento
pipe2.fit(X_train, y_train)
y_pred = pipe2.predict(X_test)
lr_pca_test = balanced_accuracy_score(y_test, y_pred)
print("Performance: ", round(lr_pca_test, 4))
Performance: 0.8036
4. DTreeViz (Decision Tree Visualization)#
# descomente o comando abaixo para instalar a dtreeviz
!pip install dtreeviz
Looking in indexes: https://SefazFeed:****@ads.intra.fazenda.sp.gov.br/tfs/PRODUTOS/_packaging/SefazFeed/pypi/simple/
WARNING: Ignoring invalid distribution -cipy (c:\users\mlfernandez\anaconda3\lib\site-packages)
WARNING: Ignoring invalid distribution -illow (c:\users\mlfernandez\anaconda3\lib\site-packages)
WARNING: Ignoring invalid distribution -tatsmodels (c:\users\mlfernandez\anaconda3\lib\site-packages)
WARNING: Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<pip._vendor.urllib3.connection.HTTPSConnection object at 0x0000023AF9BA5130>: Failed to establish a new connection: [Errno 11001] getaddrinfo failed')': /tfs/PRODUTOS/_packaging/SefazFeed/pypi/simple/dtreeviz/
WARNING: Retrying (Retry(total=3, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<pip._vendor.urllib3.connection.HTTPSConnection object at 0x0000023AF9BA5400>: Failed to establish a new connection: [Errno 11001] getaddrinfo failed')': /tfs/PRODUTOS/_packaging/SefazFeed/pypi/simple/dtreeviz/
WARNING: Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<pip._vendor.urllib3.connection.HTTPSConnection object at 0x0000023AF9BA55B0>: Failed to establish a new connection: [Errno 11001] getaddrinfo failed')': /tfs/PRODUTOS/_packaging/SefazFeed/pypi/simple/dtreeviz/
WARNING: Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<pip._vendor.urllib3.connection.HTTPSConnection object at 0x0000023AF9BA5760>: Failed to establish a new connection: [Errno 11001] getaddrinfo failed')': /tfs/PRODUTOS/_packaging/SefazFeed/pypi/simple/dtreeviz/
WARNING: Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<pip._vendor.urllib3.connection.HTTPSConnection object at 0x0000023AF9BA5910>: Failed to establish a new connection: [Errno 11001] getaddrinfo failed')': /tfs/PRODUTOS/_packaging/SefazFeed/pypi/simple/dtreeviz/
ERROR: Could not find a version that satisfies the requirement dtreeviz (from versions: none)
ERROR: No matching distribution found for dtreeviz
from sklearn.datasets import load_iris
import dtreeviz
import logging
logging.getLogger('matplotlib.font_manager').disabled = True
classifier = DecisionTreeClassifier(max_depth=3)
iris = load_iris()
classifier.fit(iris.data, iris.target)
viz = dtreeviz.model(classifier,
iris.data,
iris.target,
target_name='variety',
feature_names=iris.feature_names,
class_names=["setosa", "versicolor", "virginica"]
)
viz.view(scale=1.4)
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
~\AppData\Local\Temp\ipykernel_8884\1144491373.py in <module>
1 from sklearn.datasets import load_iris
----> 2 import dtreeviz
3 import logging
4
5 logging.getLogger('matplotlib.font_manager').disabled = True
ModuleNotFoundError: No module named 'dtreeviz'
viz.view(x=iris.data[100], scale=1.4)
Referências & Links#
Prática 5a: Predição de abandono de produto de crédito usando o algoritmo Support Vector Machine#
Descrição & Objetivo#
Descrição dos Dados: Uma empresa possui um produto de cartão de crédito muito conhecido no mercado. Contudo, dada a alta competitividade, a empresa percebeu que seus clientes começaram a abandonar seu produto. A empresa contratou você para modelar quais clientes provavelmente irão abandonar o produto, para que ela interfira proativamente evitando a perda dos clientes. Os dados disponibilizados possuem diversas características dos clientes (e.g., sexo, idade, estado civil, escolaridade) e do uso do produto de crédito (e.g., limite do cartão, tipo do cartão). A coluna Attrition_Flag indica se o cliente abandonou ou não o produto.
Objetivo: Modelar o problema com SVM para classificar quais clientes provavelmente irão abandonar o produto.
1. Leitura dos Dados#
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
import seaborn as sns
from sklearn import datasets
from sklearn.model_selection import cross_val_score
from sklearn.tree import plot_tree, DecisionTreeClassifier
from sklearn.preprocessing import OrdinalEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import balanced_accuracy_score
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
# Leitura de dados
data = pd.read_csv("credit_card_churn_data.csv")
target = 'Attrition_Flag'
categorical_features = ['Gender', 'Education_Level',
'Marital_Status', 'Income_Category', 'Card_Category'
]
numerical_features = ['Dependent_count', 'Customer_Age', 'Months_on_book',
'Total_Relationship_Count', 'Months_Inactive_12_mon',
'Contacts_Count_12_mon', 'Credit_Limit',
'Total_Revolving_Bal', 'Avg_Open_To_Buy',
'Total_Amt_Chng_Q4_Q1', 'Total_Trans_Amt',
'Total_Trans_Ct', 'Total_Ct_Chng_Q4_Q1',
'Avg_Utilization_Ratio'
]
data
| Dependent_count | Customer_Age | Months_on_book | Total_Relationship_Count | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | Gender | Education_Level | Marital_Status | Income_Category | Card_Category | Attrition_Flag | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 3 | 45 | 39 | 5 | 1 | 3 | 12691.0 | 777 | 11914.0 | 1.335 | 1144 | 42 | 1.625 | 0.061 | M | High School | Married | $60K - $80K | Blue | Existing Customer |
| 1 | 5 | 49 | 44 | 6 | 1 | 2 | 8256.0 | 864 | 7392.0 | 1.541 | 1291 | 33 | 3.714 | 0.105 | F | Graduate | Single | Less than $40K | Blue | Existing Customer |
| 2 | 3 | 51 | 36 | 4 | 1 | 0 | 3418.0 | 0 | 3418.0 | 2.594 | 1887 | 20 | 2.333 | 0.000 | M | Graduate | Married | $80K - $120K | Blue | Existing Customer |
| 3 | 4 | 40 | 34 | 3 | 4 | 1 | 3313.0 | 2517 | 796.0 | 1.405 | 1171 | 20 | 2.333 | 0.760 | F | High School | Unknown | Less than $40K | Blue | Existing Customer |
| 4 | 3 | 40 | 21 | 5 | 1 | 0 | 4716.0 | 0 | 4716.0 | 2.175 | 816 | 28 | 2.500 | 0.000 | M | Uneducated | Married | $60K - $80K | Blue | Existing Customer |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 10122 | 2 | 50 | 40 | 3 | 2 | 3 | 4003.0 | 1851 | 2152.0 | 0.703 | 15476 | 117 | 0.857 | 0.462 | M | Graduate | Single | $40K - $60K | Blue | Existing Customer |
| 10123 | 2 | 41 | 25 | 4 | 2 | 3 | 4277.0 | 2186 | 2091.0 | 0.804 | 8764 | 69 | 0.683 | 0.511 | M | Unknown | Divorced | $40K - $60K | Blue | Attrited Customer |
| 10124 | 1 | 44 | 36 | 5 | 3 | 4 | 5409.0 | 0 | 5409.0 | 0.819 | 10291 | 60 | 0.818 | 0.000 | F | High School | Married | Less than $40K | Blue | Attrited Customer |
| 10125 | 2 | 30 | 36 | 4 | 3 | 3 | 5281.0 | 0 | 5281.0 | 0.535 | 8395 | 62 | 0.722 | 0.000 | M | Graduate | Unknown | $40K - $60K | Blue | Attrited Customer |
| 10126 | 2 | 43 | 25 | 6 | 2 | 4 | 10388.0 | 1961 | 8427.0 | 0.703 | 10294 | 61 | 0.649 | 0.189 | F | Graduate | Married | Less than $40K | Silver | Attrited Customer |
10127 rows × 20 columns
2. Análise Exploratória dos Dados#
# Número de exemplos e features
data.shape
(10127, 20)
# Contar classes
data[target].value_counts()
Attrition_Flag
Existing Customer 8500
Attrited Customer 1627
Name: count, dtype: int64
# Porcentagem de exemplos de cada classe
data[target].value_counts(normalize=True)
Attrition_Flag
Existing Customer 0.83934
Attrited Customer 0.16066
Name: proportion, dtype: float64
# Verificar se tem NaN
data.isna().sum()
Dependent_count 0
Customer_Age 0
Months_on_book 0
Total_Relationship_Count 0
Months_Inactive_12_mon 0
Contacts_Count_12_mon 0
Credit_Limit 0
Total_Revolving_Bal 0
Avg_Open_To_Buy 0
Total_Amt_Chng_Q4_Q1 0
Total_Trans_Amt 0
Total_Trans_Ct 0
Total_Ct_Chng_Q4_Q1 0
Avg_Utilization_Ratio 0
Gender 0
Education_Level 0
Marital_Status 0
Income_Category 0
Card_Category 0
Attrition_Flag 0
dtype: int64
# Descrever dados categóricos
data.describe(include=[object])
| Gender | Education_Level | Marital_Status | Income_Category | Card_Category | Attrition_Flag | |
|---|---|---|---|---|---|---|
| count | 10127 | 10127 | 10127 | 10127 | 10127 | 10127 |
| unique | 2 | 7 | 4 | 6 | 4 | 2 |
| top | F | Graduate | Married | Less than $40K | Blue | Existing Customer |
| freq | 5358 | 3128 | 4687 | 3561 | 9436 | 8500 |
# Descrever dados numéricos
data.describe(include=[np.number])
| Dependent_count | Customer_Age | Months_on_book | Total_Relationship_Count | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 10127.000000 | 10127.000000 | 10127.000000 | 10127.000000 | 10127.000000 | 10127.000000 | 10127.000000 | 10127.000000 | 10127.000000 | 10127.000000 | 10127.000000 | 10127.000000 | 10127.000000 | 10127.000000 |
| mean | 2.346203 | 46.325960 | 35.928409 | 3.812580 | 2.341167 | 2.455317 | 8631.953698 | 1162.814061 | 7469.139637 | 0.759941 | 4404.086304 | 64.858695 | 0.712222 | 0.274894 |
| std | 1.298908 | 8.016814 | 7.986416 | 1.554408 | 1.010622 | 1.106225 | 9088.776650 | 814.987335 | 9090.685324 | 0.219207 | 3397.129254 | 23.472570 | 0.238086 | 0.275691 |
| min | 0.000000 | 26.000000 | 13.000000 | 1.000000 | 0.000000 | 0.000000 | 1438.300000 | 0.000000 | 3.000000 | 0.000000 | 510.000000 | 10.000000 | 0.000000 | 0.000000 |
| 25% | 1.000000 | 41.000000 | 31.000000 | 3.000000 | 2.000000 | 2.000000 | 2555.000000 | 359.000000 | 1324.500000 | 0.631000 | 2155.500000 | 45.000000 | 0.582000 | 0.023000 |
| 50% | 2.000000 | 46.000000 | 36.000000 | 4.000000 | 2.000000 | 2.000000 | 4549.000000 | 1276.000000 | 3474.000000 | 0.736000 | 3899.000000 | 67.000000 | 0.702000 | 0.176000 |
| 75% | 3.000000 | 52.000000 | 40.000000 | 5.000000 | 3.000000 | 3.000000 | 11067.500000 | 1784.000000 | 9859.000000 | 0.859000 | 4741.000000 | 81.000000 | 0.818000 | 0.503000 |
| max | 5.000000 | 73.000000 | 56.000000 | 6.000000 | 6.000000 | 6.000000 | 34516.000000 | 2517.000000 | 34516.000000 | 3.397000 | 18484.000000 | 139.000000 | 3.714000 | 0.999000 |
3. Modelagem & Avaliação#
3.1 Baseline#
# Separar os dados em treinamento e teste
X = data.drop(columns=[target])
y = data[target]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y, random_state=42)
# Transformar features categóricas via one-hot-encoding
ct = ColumnTransformer(
transformers=[
("ohe", OneHotEncoder(), categorical_features),
],
remainder='passthrough'
)
# zscore para mudar escala os dados
scaler = StandardScaler()
# modelo baseline
lrc = LogisticRegression(random_state=42)
# pipeline de machine learning
pipeb = Pipeline([
('column-transformer', ct),
('scaler', scaler),
('model', lrc)
])
# cross validação da solução
cv_list_lr_baseline = cross_val_score(
pipeb,
X_train,
y_train,
cv=10,
scoring="balanced_accuracy"
)
mean_cv_lr_baseline = np.mean(cv_list_lr_baseline)
std_cv_lr_baseline = np.std(cv_list_lr_baseline)
print(f"Performance (bac): {round(mean_cv_lr_baseline, 4)} +- {round(std_cv_lr_baseline, 4)}")
Performance (bac): 0.7834 +- 0.0227
3.2 Modelando com SVM#
n_samples = 1500
X_, Y_ = datasets.make_circles(n_samples=n_samples, factor=0.5, noise=0.05)
Y_ = ["c1" if i == 0 else "c2" for i in Y_]
aux_data = pd.DataFrame({"x": X_[:, 0], "y": X_[:, 1], "color": Y_})
fig = px.scatter(aux_data, x="x", y="y", color="color", opacity=0.5)
fig.update_layout(autosize=False, width=600, height=600)
fig.show()
Mapeando os dados em uma função polinomial de segundo grau:
\( k(x, y) = \begin{pmatrix} x^2\\ \sqrt{2}xy\\ y^2\\ \end{pmatrix} \)
def k(x, y):
return x**2, np.sqrt(2)*x*y, y**2
x, y, z = k(X_[:,0], X_[:, 1])
aux_data = pd.DataFrame({"x": x, "y": y, "z": z, "color": Y_})
fig = px.scatter_3d(aux_data, x="x", y="y", z="z", color="color", opacity=0.5)
fig.update_layout(autosize=False, width=600, height=600)
fig.show()
Kernel Linear#
O kernel linear é definido no sklearn como:
\(\left \langle x, x' \right \rangle\)
# transformar features categóricas via one-hot-encoding
ct = ColumnTransformer(
transformers=[
("ohe", OneHotEncoder(), categorical_features),
],
remainder='passthrough'
)
# minmax para mudar escala os dados
scaler = MinMaxScaler()
# modelo SVM
svc = SVC(
C=1.0, # C é o hiperparâmetro de regularização.
# Ele controla o trade-off entre o limite de decisão suave e a
# classificação correta dos pontos de treinamento.
# O aumento dos valores de C pode levar a um ajuste excessivo dos dados de treinamento.
kernel="linear", # tipo do kernel
class_weight="balanced", # atribuir o mesmo peso para as duas classes
max_iter=100000, # número de iterações do otimizador
random_state=42
)
# pipeline de machine learning
pipe1 = Pipeline([
('column-transformer', ct),
('scaler', scaler),
('model', svc)
])
# cross validação da solução
cv_list_pipe1_baseline = cross_val_score(
pipe1,
X_train,
y_train,
cv=10,
scoring="balanced_accuracy"
)
mean_cv_pipe1_baseline = np.mean(cv_list_pipe1_baseline)
std_cv_pipe1_baseline = np.std(cv_list_pipe1_baseline)
print(f"Performance (bac): {round(mean_cv_pipe1_baseline, 4)} +- {round(std_cv_pipe1_baseline, 4)}")
Performance (bac): 0.855 +- 0.0119
Kernel Polinomial#
O kernel polinomial é definido no sklearn como:
\(\left ( \gamma \left \langle x, x' \right \rangle + r \right )^d\)
onde, \(d\) é especificado pelo hyperparametro degree, \(r\) por coef0 e \(\gamma\) por gamma.
# transformar features categóricas via one-hot-encoding
ct = ColumnTransformer(
transformers=[
("ohe", OneHotEncoder(), categorical_features),
],
remainder='passthrough'
)
# minmax para mudar escala os dados
scaler = MinMaxScaler()
# modelo SVM
svc = SVC(
C=1.0, # C é o hiperparâmetro de regularização.
# Ele controla o trade-off entre o limite de decisão suave e a
# classificação correta dos pontos de treinamento.
# O aumento dos valores de C pode levar a um ajuste excessivo dos dados de treinamento.
kernel="poly", # tipo do kernel
degree=3, # grau do kernel polinomial
coef0=1, # termo independente da função kernel
gamma="scale", # coeficiente gamma da função kernel
# scale = 1 / (n_features * X.var())
# auto = 1 / n_features
# podemos usar float > 0
class_weight="balanced", # atribuir o mesmo peso para as duas classes
max_iter=100000, # número de iterações do otimizador
random_state=42
)
# pipeline de machine learning
pipe2 = Pipeline([
('column-transformer', ct),
('scaler', scaler),
('model', svc)
])
# cross validação da solução
cv_list_pipe2_baseline = cross_val_score(
pipe2,
X_train,
y_train,
cv=10,
scoring="balanced_accuracy"
)
mean_cv_pipe2_baseline = np.mean(cv_list_pipe2_baseline)
std_cv_pipe2_baseline = np.std(cv_list_pipe2_baseline)
print(f"Performance (bac): {round(mean_cv_pipe2_baseline, 4)} +- {round(std_cv_pipe2_baseline, 4)}")
Performance (bac): 0.8692 +- 0.0162
Kernel RBF#
O kernel rbf é definido no sklearn como:
\(exp(-\gamma \left \| x - x' \right\|^2)\)
onde, \(\gamma\) é especificado pelo hyperparametro gamma e deve ser \(\geq 0\).
# transformar features categóricas via one-hot-encoding
ct = ColumnTransformer(
transformers=[
("ohe", OneHotEncoder(), categorical_features),
],
remainder='passthrough'
)
# minmax para mudar escala os dados
scaler = MinMaxScaler()
# modelo SVM
svc = SVC(
C=1.0, # C é o hiperparâmetro de regularização.
# Ele controla o trade-off entre o limite de decisão suave e a
# classificação correta dos pontos de treinamento.
# O aumento dos valores de C pode levar a um ajuste excessivo dos dados de treinamento.
kernel="rbf", # tipo do kernel
gamma="scale", # coeficiente gamma da função kernel
# scale = 1 / (n_features * X.var())
# auto = 1 / n_features
# podemos usar float > 0
class_weight="balanced", # atribuir o mesmo peso para as duas classes
max_iter=100000, # número de iterações do otimizador
random_state=42
)
# pipeline de machine learning
pipe3 = Pipeline([
('column-transformer', ct),
('scaler', scaler),
('model', svc)
])
# cross validação da solução
cv_list_pipe3_baseline = cross_val_score(
pipe3,
X_train,
y_train,
cv=10,
scoring="balanced_accuracy"
)
mean_cv_pipe3_baseline = np.mean(cv_list_pipe3_baseline)
std_cv_pipe3_baseline = np.std(cv_list_pipe3_baseline)
print(f"Performance (bac): {round(mean_cv_pipe3_baseline, 4)} +- {round(std_cv_pipe3_baseline, 4)}")
Performance (bac): 0.857 +- 0.0176
Avaliação Experimental#
# resultados da cross-validação
df_result_cv = pd.DataFrame(
[cv_list_lr_baseline, cv_list_pipe1_baseline, cv_list_pipe2_baseline, cv_list_pipe3_baseline],
index=["baseline","SVM-linear", "SVM-poly", "SMV-rbf"]
).T
df_result_cv
| baseline | SVM-linear | SVM-poly | SMV-rbf | |
|---|---|---|---|---|
| 0 | 0.761374 | 0.846160 | 0.842349 | 0.824149 |
| 1 | 0.765576 | 0.847759 | 0.867743 | 0.861861 |
| 2 | 0.768097 | 0.881166 | 0.889017 | 0.884159 |
| 3 | 0.777053 | 0.856347 | 0.881925 | 0.866431 |
| 4 | 0.777053 | 0.854666 | 0.861020 | 0.839275 |
| 5 | 0.782279 | 0.855587 | 0.862045 | 0.859052 |
| 6 | 0.836776 | 0.867168 | 0.891722 | 0.862413 |
| 7 | 0.777893 | 0.860077 | 0.864934 | 0.869505 |
| 8 | 0.815479 | 0.839403 | 0.847289 | 0.834308 |
| 9 | 0.772090 | 0.841585 | 0.884272 | 0.868476 |
# linearizar matriz
df_res = df_result_cv.stack().to_frame("balanced_accuracy")
df_res.index.rename(["fold", "pipelines"], inplace=True)
df_res = df_res.reset_index()
df_res.head(12)
| fold | pipelines | balanced_accuracy | |
|---|---|---|---|
| 0 | 0 | baseline | 0.761374 |
| 1 | 0 | SVM-linear | 0.846160 |
| 2 | 0 | SVM-poly | 0.842349 |
| 3 | 0 | SMV-rbf | 0.824149 |
| 4 | 1 | baseline | 0.765576 |
| 5 | 1 | SVM-linear | 0.847759 |
| 6 | 1 | SVM-poly | 0.867743 |
| 7 | 1 | SMV-rbf | 0.861861 |
| 8 | 2 | baseline | 0.768097 |
| 9 | 2 | SVM-linear | 0.881166 |
| 10 | 2 | SVM-poly | 0.889017 |
| 11 | 2 | SMV-rbf | 0.884159 |
plt.figure(figsize=(10,10))
ax = sns.boxplot(x="pipelines", y="balanced_accuracy", data=df_res)
ax = sns.swarmplot(x="pipelines", y="balanced_accuracy", data=df_res, color=".40")
# retreinar o pipeline selecionado com todos os dados de treinamento
pipe2.fit(X_train, y_train)
y_pred = pipe2.predict(X_test)
bac = balanced_accuracy_score(y_test, y_pred)
print("Performance: ", round(bac, 4))
Performance: 0.8633
Referências & Links#
© ealcobaca (2022)
MBA em Ciência de Dados#
Universidade de São Paulo, São Carlos, Brasil#
Disciplina: Aprendizado de Máquina
Professor Reponsavel: Prof. Dr. André Carlos Ponce de Leon Ferreira de Carvalho 📧 andre@icmc.usp.br 🌐 linkedin 🔗 site
Prática 5b: Predição de abandono de produto de crédito usando o algoritmo Multilayer Perceptron#
Autor: Edesio Alcobaça 📧 edesio@usp.br 🌐 linkedin 🔗 site
Descrição & Objetivo#
Descrição dos Dados: Uma empresa possui um produto de cartão de crédito muito conhecido no mercado. Contudo, dada a alta competitividade, a empresa percebeu que seus clientes começaram a abandonar seu produto. A empresa contratou você para modelar quais clientes provavelmente irão abandonar o produto, para que ela interfira proativamente evitando a perda dos clientes. Os dados disponibilizados possuem diversas características dos clientes (e.g., sexo, idade, estado civil, escolaridade) e do uso do produto de crédito (e.g., limite do cartão, tipo do cartão). A coluna Attrition_Flag indica se o cliente abandonou ou não o produto.
Objetivo: Modelar o problema com MLP de forma a retornar uma lista ordenada de clientes, do mais ao menos provável a abandonar o produto.
1. Leitura dos Dados#
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
import seaborn as sns
from sklearn import datasets
from sklearn.model_selection import cross_val_score
from sklearn.tree import plot_tree, DecisionTreeClassifier
from sklearn.preprocessing import OrdinalEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import balanced_accuracy_score
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
# Leitura de dados
data = pd.read_csv("credit_card_churn_data.csv")
target = 'Attrition_Flag'
categorical_features = ['Gender', 'Education_Level',
'Marital_Status', 'Income_Category', 'Card_Category'
]
numerical_features = ['Dependent_count', 'Customer_Age', 'Months_on_book',
'Total_Relationship_Count', 'Months_Inactive_12_mon',
'Contacts_Count_12_mon', 'Credit_Limit',
'Total_Revolving_Bal', 'Avg_Open_To_Buy',
'Total_Amt_Chng_Q4_Q1', 'Total_Trans_Amt',
'Total_Trans_Ct', 'Total_Ct_Chng_Q4_Q1',
'Avg_Utilization_Ratio'
]
data
| Dependent_count | Customer_Age | Months_on_book | Total_Relationship_Count | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | Gender | Education_Level | Marital_Status | Income_Category | Card_Category | Attrition_Flag | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 3 | 45 | 39 | 5 | 1 | 3 | 12691.0 | 777 | 11914.0 | 1.335 | 1144 | 42 | 1.625 | 0.061 | M | High School | Married | $60K - $80K | Blue | Existing Customer |
| 1 | 5 | 49 | 44 | 6 | 1 | 2 | 8256.0 | 864 | 7392.0 | 1.541 | 1291 | 33 | 3.714 | 0.105 | F | Graduate | Single | Less than $40K | Blue | Existing Customer |
| 2 | 3 | 51 | 36 | 4 | 1 | 0 | 3418.0 | 0 | 3418.0 | 2.594 | 1887 | 20 | 2.333 | 0.000 | M | Graduate | Married | $80K - $120K | Blue | Existing Customer |
| 3 | 4 | 40 | 34 | 3 | 4 | 1 | 3313.0 | 2517 | 796.0 | 1.405 | 1171 | 20 | 2.333 | 0.760 | F | High School | Unknown | Less than $40K | Blue | Existing Customer |
| 4 | 3 | 40 | 21 | 5 | 1 | 0 | 4716.0 | 0 | 4716.0 | 2.175 | 816 | 28 | 2.500 | 0.000 | M | Uneducated | Married | $60K - $80K | Blue | Existing Customer |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 10122 | 2 | 50 | 40 | 3 | 2 | 3 | 4003.0 | 1851 | 2152.0 | 0.703 | 15476 | 117 | 0.857 | 0.462 | M | Graduate | Single | $40K - $60K | Blue | Existing Customer |
| 10123 | 2 | 41 | 25 | 4 | 2 | 3 | 4277.0 | 2186 | 2091.0 | 0.804 | 8764 | 69 | 0.683 | 0.511 | M | Unknown | Divorced | $40K - $60K | Blue | Attrited Customer |
| 10124 | 1 | 44 | 36 | 5 | 3 | 4 | 5409.0 | 0 | 5409.0 | 0.819 | 10291 | 60 | 0.818 | 0.000 | F | High School | Married | Less than $40K | Blue | Attrited Customer |
| 10125 | 2 | 30 | 36 | 4 | 3 | 3 | 5281.0 | 0 | 5281.0 | 0.535 | 8395 | 62 | 0.722 | 0.000 | M | Graduate | Unknown | $40K - $60K | Blue | Attrited Customer |
| 10126 | 2 | 43 | 25 | 6 | 2 | 4 | 10388.0 | 1961 | 8427.0 | 0.703 | 10294 | 61 | 0.649 | 0.189 | F | Graduate | Married | Less than $40K | Silver | Attrited Customer |
10127 rows × 20 columns
2. Análise Exploratória dos Dados#
# Número de exemplos e features
data.shape
(10127, 20)
# Contar classes
data[target].value_counts()
Existing Customer 8500
Attrited Customer 1627
Name: Attrition_Flag, dtype: int64
# Porcentagem de exemplos de cada classe
data[target].value_counts(normalize=True)
Existing Customer 0.83934
Attrited Customer 0.16066
Name: Attrition_Flag, dtype: float64
# Verificar se tem NaN
data.isna().sum()
Dependent_count 0
Customer_Age 0
Months_on_book 0
Total_Relationship_Count 0
Months_Inactive_12_mon 0
Contacts_Count_12_mon 0
Credit_Limit 0
Total_Revolving_Bal 0
Avg_Open_To_Buy 0
Total_Amt_Chng_Q4_Q1 0
Total_Trans_Amt 0
Total_Trans_Ct 0
Total_Ct_Chng_Q4_Q1 0
Avg_Utilization_Ratio 0
Gender 0
Education_Level 0
Marital_Status 0
Income_Category 0
Card_Category 0
Attrition_Flag 0
dtype: int64
# Descrever dados categóricos
data.describe(include=[object])
| Gender | Education_Level | Marital_Status | Income_Category | Card_Category | Attrition_Flag | |
|---|---|---|---|---|---|---|
| count | 10127 | 10127 | 10127 | 10127 | 10127 | 10127 |
| unique | 2 | 7 | 4 | 6 | 4 | 2 |
| top | F | Graduate | Married | Less than $40K | Blue | Existing Customer |
| freq | 5358 | 3128 | 4687 | 3561 | 9436 | 8500 |
# Descrever dados numéricos
data.describe(include=[np.number])
| Dependent_count | Customer_Age | Months_on_book | Total_Relationship_Count | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 10127.000000 | 10127.000000 | 10127.000000 | 10127.000000 | 10127.000000 | 10127.000000 | 10127.000000 | 10127.000000 | 10127.000000 | 10127.000000 | 10127.000000 | 10127.000000 | 10127.000000 | 10127.000000 |
| mean | 2.346203 | 46.325960 | 35.928409 | 3.812580 | 2.341167 | 2.455317 | 8631.953698 | 1162.814061 | 7469.139637 | 0.759941 | 4404.086304 | 64.858695 | 0.712222 | 0.274894 |
| std | 1.298908 | 8.016814 | 7.986416 | 1.554408 | 1.010622 | 1.106225 | 9088.776650 | 814.987335 | 9090.685324 | 0.219207 | 3397.129254 | 23.472570 | 0.238086 | 0.275691 |
| min | 0.000000 | 26.000000 | 13.000000 | 1.000000 | 0.000000 | 0.000000 | 1438.300000 | 0.000000 | 3.000000 | 0.000000 | 510.000000 | 10.000000 | 0.000000 | 0.000000 |
| 25% | 1.000000 | 41.000000 | 31.000000 | 3.000000 | 2.000000 | 2.000000 | 2555.000000 | 359.000000 | 1324.500000 | 0.631000 | 2155.500000 | 45.000000 | 0.582000 | 0.023000 |
| 50% | 2.000000 | 46.000000 | 36.000000 | 4.000000 | 2.000000 | 2.000000 | 4549.000000 | 1276.000000 | 3474.000000 | 0.736000 | 3899.000000 | 67.000000 | 0.702000 | 0.176000 |
| 75% | 3.000000 | 52.000000 | 40.000000 | 5.000000 | 3.000000 | 3.000000 | 11067.500000 | 1784.000000 | 9859.000000 | 0.859000 | 4741.000000 | 81.000000 | 0.818000 | 0.503000 |
| max | 5.000000 | 73.000000 | 56.000000 | 6.000000 | 6.000000 | 6.000000 | 34516.000000 | 2517.000000 | 34516.000000 | 3.397000 | 18484.000000 | 139.000000 | 3.714000 | 0.999000 |
3. Modelagem & Avaliação#
3.1 Baseline#
# Separar os dados em treinamento e teste
X = data.drop(columns=[target])
y = data[target]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y, random_state=42)
# transformar features categóricas via one-hot-encoding
ct = ColumnTransformer(
transformers=[
("ohe", OneHotEncoder(), categorical_features),
],
remainder='passthrough'
)
# zscore para mudar escala os dados
scaler = StandardScaler()
# modelo baseline
lrc = LogisticRegression(random_state=42)
# pipeline de machine learning
pipeb = Pipeline([
('column-transformer', ct),
('scaler', scaler),
('model', lrc)
])
# cross validação da solução
cv_list_lr_baseline = cross_val_score(
pipeb,
X_train,
y_train,
cv=10,
scoring="balanced_accuracy"
)
mean_cv_lr_baseline = np.mean(cv_list_lr_baseline)
std_cv_lr_baseline = np.std(cv_list_lr_baseline)
print(f"Performance (bac): {round(mean_cv_lr_baseline, 4)} +- {round(std_cv_lr_baseline, 4)}")
Performance (bac): 0.7834 +- 0.0227
3.2 Modelando com MLP#

1 camada oculta –> (10, )#
# transformar features categóricas via one-hot-encoding
ct = ColumnTransformer(
transformers=[
("ohe", OneHotEncoder(), categorical_features),
],
remainder='passthrough'
)
# minmax para mudar escala os dados
scaler = MinMaxScaler()
# modelo baseline
mlp = MLPClassifier(
hidden_layer_sizes=(10,), # numero de neurônios
activation = "relu", # função de ativação para a camada oculta
# As possibilidades são: identity, logistic, tanh e relu
alpha=0.0001, # Força do termo regularizador L2.
solver="adam", # otimizador
batch_size="auto", # Tamanho do mini-batch para otimizadores estocásticos
learning_rate_init= 0.01, # taxa de aprendizado inicial. Ele controla o tamanho do passo na atualização dos pesos
learning_rate="constant", # possibilidades: constant, invscaling, adaptative
early_stopping=False, # Se a interrupção antecipada deve ser usada para encerrar
# o treinamento quando a performance de validação não estiver melhorando.
max_iter=200, # número de iterações do otimizador
random_state=0
)
# pipeline de machine learning
pipe1 = Pipeline([
('column-transformer', ct),
('scaler', scaler),
('model', mlp)
])
# cross validação da solução
cv_list_pipe1_baseline = cross_val_score(
pipe1,
X_train,
y_train,
cv=10,
scoring="balanced_accuracy"
)
mean_cv_pipe1_baseline = np.mean(cv_list_pipe1_baseline)
std_cv_pipe1_baseline = np.std(cv_list_pipe1_baseline)
print(f"Performance (bac): {round(mean_cv_pipe1_baseline, 4)} +- {round(std_cv_pipe1_baseline, 4)}")
Performance (bac): 0.8612 +- 0.0277
2 camada oculta –> (10, 10)#
# transformar features categóricas via one-hot-encoding
ct = ColumnTransformer(
transformers=[
("ohe", OneHotEncoder(), categorical_features),
],
remainder='passthrough'
)
# minmax para mudar escala os dados
scaler = MinMaxScaler()
# modelo baseline
mlp = MLPClassifier(
hidden_layer_sizes=(10, 10), # número de neurônios
activation = "relu", # função de ativação para a camada oculta
# As possibilidades são: identity, logistic, tanh e relu
alpha=0.0001, # Força do termo regularizador L2.
solver="adam", # otimizador
batch_size="auto", # Tamanho do mini-batch para otimizadores estocásticos
learning_rate_init= 0.01, # taxa de aprendizado inicial. Ele controla o tamanho do passo na atualização dos pesos
learning_rate="constant", # possibilidades: constant, invscaling, adaptative
early_stopping=False, # Se a interrupção antecipada deve ser usada para encerrar
# o treinamento quando a performance de validação não estiver melhorando.
max_iter=200, # número de iterações do otimizador
random_state=0
)
# pipeline de machine learning
pipe2 = Pipeline([
('column-transformer', ct),
('scaler', scaler),
('model', mlp)
])
# cross validação da solução
cv_list_pipe2_baseline = cross_val_score(
pipe2,
X_train,
y_train,
cv=10,
scoring="balanced_accuracy"
)
mean_cv_pipe2_baseline = np.mean(cv_list_pipe2_baseline)
std_cv_pipe2_baseline = np.std(cv_list_pipe2_baseline)
print(f"Performance (bac): {round(mean_cv_pipe2_baseline, 4)} +- {round(std_cv_pipe2_baseline, 4)}")
/usr/local/lib/python3.7/dist-packages/sklearn/neural_network/_multilayer_perceptron.py:696: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet.
ConvergenceWarning,
Performance (bac): 0.8679 +- 0.0367
3 camada oculta –> (10, 10, 10)#
# transformar features categóricas via one-hot-encoding
ct = ColumnTransformer(
transformers=[
("ohe", OneHotEncoder(), categorical_features),
],
remainder='passthrough'
)
# minmax para mudar escala os dados
scaler = MinMaxScaler()
# modelo baseline
mlp = MLPClassifier(
hidden_layer_sizes=(10, 10, 10), # número de neurônios
activation = "relu", # função de ativação para a camada oculta
# As possibilidades são: identity, logistic, tanh e relu
alpha=0.0001, # Força do termo regularizador L2.
solver="adam", # otimizador
batch_size="auto", # Tamanho do mini-batch para otimizadores estocásticos
learning_rate_init= 0.01, # taxa de aprendizado inicial. Ele controla o tamanho do passo na atualização dos pesos
learning_rate="constant", # possibilidades: constant, invscaling, adaptative
early_stopping=False, # Se a interrupção antecipada deve ser usada para encerrar
# o treinamento quando a performance de validação não estiver melhorando.
max_iter=200, # número de iterações do otimizador
random_state=0
)
# pipeline de machine learning
pipe3 = Pipeline([
('column-transformer', ct),
('scaler', scaler),
('model', mlp)
])
# cross validação da solução
cv_list_pipe3_baseline = cross_val_score(
pipe3,
X_train,
y_train,
cv=10,
scoring="balanced_accuracy"
)
mean_cv_pipe3_baseline = np.mean(cv_list_pipe3_baseline)
std_cv_pipe3_baseline = np.std(cv_list_pipe3_baseline)
print(f"Performance (bac): {round(mean_cv_pipe3_baseline, 4)} +- {round(std_cv_pipe3_baseline, 4)}")
Performance (bac): 0.9082 +- 0.0129
4 camada oculta –> (10, 10, 10, 10)#
# transformar features categóricas via one-hot-encoding
ct = ColumnTransformer(
transformers=[
("ohe", OneHotEncoder(), categorical_features),
],
remainder='passthrough'
)
# zscore para mudar escala os dados
scaler = MinMaxScaler()
# modelo baseline
mlp = MLPClassifier(
hidden_layer_sizes=(10, 10, 10, 10), # número de neurônios
activation = "relu", # função de ativação para a camada oculta
# As possibilidades são: identity, logistic, tanh e relu
alpha=0.0001, # Força do termo regularizador L2.
solver="adam", # otimizador
batch_size="auto", # Tamanho do mini-batch para otimizadores estocásticos
learning_rate_init= 0.01, # taxa de aprendizado inicial. Ele controla o tamanho do passo na atualização dos pesos
learning_rate="constant", # possibilidades: constant, invscaling, adaptative
early_stopping=False, # Se a interrupção antecipada deve ser usada para encerrar
# o treinamento quando a performance de validação não estiver melhorando.
max_iter=200, # número de iterações do otimizador
random_state=0
)
# pipeline de machine learning
pipe4 = Pipeline([
('column-transformer', ct),
('scaler', scaler),
('model', mlp)
])
# cross validação da solução
cv_list_pipe4_baseline = cross_val_score(
pipe4,
X_train,
y_train,
cv=10,
scoring="balanced_accuracy"
)
mean_cv_pipe4_baseline = np.mean(cv_list_pipe4_baseline)
std_cv_pipe4_baseline = np.std(cv_list_pipe4_baseline)
print(f"Performance (bac): {round(mean_cv_pipe4_baseline, 4)} +- {round(std_cv_pipe4_baseline, 4)}")
Performance (bac): 0.8684 +- 0.0163
Avaliação Experimental#
# resultados da cross-validação
df_result_cv = pd.DataFrame(
[cv_list_lr_baseline, cv_list_pipe1_baseline, cv_list_pipe2_baseline, cv_list_pipe3_baseline, cv_list_pipe4_baseline],
index=["baseline","MLP(10,)", "MLP(10, 10)", "MLP(10, 10, 10)", "MLP(10, 10, 10, 10)"]
).T
df_result_cv
| baseline | MLP(10,) | MLP(10, 10) | MLP(10, 10, 10) | MLP(10, 10, 10, 10) | |
|---|---|---|---|---|---|
| 0 | 0.761374 | 0.822778 | 0.853008 | 0.899676 | 0.872232 |
| 1 | 0.765576 | 0.830341 | 0.863092 | 0.911337 | 0.833886 |
| 2 | 0.768097 | 0.863092 | 0.853111 | 0.880267 | 0.866453 |
| 3 | 0.777053 | 0.908448 | 0.892769 | 0.905558 | 0.862067 |
| 4 | 0.777053 | 0.872520 | 0.879979 | 0.923286 | 0.896970 |
| 5 | 0.782279 | 0.834911 | 0.816342 | 0.901356 | 0.883341 |
| 6 | 0.836776 | 0.887542 | 0.889776 | 0.923470 | 0.868974 |
| 7 | 0.777893 | 0.849093 | 0.800295 | 0.903590 | 0.857865 |
| 8 | 0.815479 | 0.895800 | 0.919059 | 0.908426 | 0.882111 |
| 9 | 0.772090 | 0.847929 | 0.912003 | 0.924831 | 0.860140 |
# linearizar matriz
df_res = df_result_cv.stack().to_frame("balanced_accuracy")
df_res.index.rename(["fold", "pipelines"], inplace=True)
df_res = df_res.reset_index()
df_res.head(12)
| fold | pipelines | balanced_accuracy | |
|---|---|---|---|
| 0 | 0 | baseline | 0.761374 |
| 1 | 0 | MLP(10,) | 0.822778 |
| 2 | 0 | MLP(10, 10) | 0.853008 |
| 3 | 0 | MLP(10, 10, 10) | 0.899676 |
| 4 | 0 | MLP(10, 10, 10, 10) | 0.872232 |
| 5 | 1 | baseline | 0.765576 |
| 6 | 1 | MLP(10,) | 0.830341 |
| 7 | 1 | MLP(10, 10) | 0.863092 |
| 8 | 1 | MLP(10, 10, 10) | 0.911337 |
| 9 | 1 | MLP(10, 10, 10, 10) | 0.833886 |
| 10 | 2 | baseline | 0.768097 |
| 11 | 2 | MLP(10,) | 0.863092 |
plt.figure(figsize=(10,10))
ax = sns.boxplot(x="pipelines", y="balanced_accuracy", data=df_res)
ax = sns.swarmplot(x="pipelines", y="balanced_accuracy", data=df_res, color=".40")
ROC curve#
from sklearn.metrics import DetCurveDisplay, RocCurveDisplay
pipelines = {
"baseline": pipeb,
"MLP(10,)": pipe1,
"MLP(10, 10)": pipe2,
"MLP(10, 10, 10)": pipe3,
"MLP(10, 10, 10, 10)": pipe4,
}
fig, [ax_roc, ax_det] = plt.subplots(1, 2, figsize=(11, 5))
for name, clf in pipelines.items():
clf.fit(X_train, y_train)
RocCurveDisplay.from_estimator(clf, X_test, y_test, ax=ax_roc, name=name)
DetCurveDisplay.from_estimator(clf, X_test, y_test, ax=ax_det, name=name)
ax_roc.set_title("Receiver Operating Characteristic (ROC) curves")
ax_det.set_title("Detection Error Tradeoff (DET) curves")
ax_roc.grid(linestyle="--")
ax_det.grid(linestyle="--")
plt.legend()
plt.show()
Final pipeline#
# retreinar o pipeline selecionado com todos os dados de treinamento
pipe3.fit(X_train, y_train)
y_pred = pipe2.predict(X_test)
bac = balanced_accuracy_score(y_test, y_pred)
print("Performance: ", round(bac, 4))
Performance: 0.8441
# Em vez de usar as predições podemos usar a probabilidade
# Desta maneira conseguiremos construir uma lista, ordenando dos mais aos menos prováveis de abandonar o produto
y_pred_prob = pipe3.predict_proba(X_test)
y_pred_prob[:, 0]
array([4.96625438e-05, 1.80525853e-03, 9.37936844e-01, ...,
1.35154815e-02, 7.06941771e-07, 6.21152944e-04])
Referências & Links#
Prática 6: Usando Algoritmos de Ensemble em um Problema de Regressão#
Descrição & Objetivo#
Descrição dos Dados: Consumidores fanáticos por abacate decidiram criar uma planilha com o preço médio anual em diversas regiões durante os anos de 2015 a 2018. Os seguintes campos foram anotados:
AveragePrice - preço médio de um abacate
type - tipo do abacate: conventional or organic
year - ano
Region - cidade ou região onde a observação foi feita
Total Volume - Número médio de abacates vendidos diariamente
4046 - Número médio diário de abacates vendidos com PLU 4046
4225 - Número médio diário de abacates vendidos com PLU 4225
4770 - Número médio diário de abacates vendidos com PLU 4770
Para entender mais sobre o PLU veja esse site.
Objetivo: Modelar o problema com Ensembles de forma a prever o preço médio de um abacate.
1. Imports#
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
import seaborn as sns
from sklearn import datasets
from sklearn.model_selection import cross_val_score
from sklearn.tree import plot_tree, DecisionTreeClassifier
from sklearn.preprocessing import OrdinalEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import balanced_accuracy_score
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
2. Leitura dos Dados#
# Leitura de dados
data = pd.read_csv("avocado_grouped_data.csv")
target = 'AveragePrice'
categorical_features = ["region", "type"]
numerical_features = ["year", "Total Volume", "4046", "4225", "4770"]
data
| year | type | region | AveragePrice | Total Volume | 4046 | 4225 | 4770 | |
|---|---|---|---|---|---|---|---|---|
| 0 | 2015 | conventional | Albany | 1.171923 | 7.620873e+04 | 1037.874615 | 61764.253654 | 668.795000 |
| 1 | 2015 | conventional | Atlanta | 1.052308 | 4.403464e+05 | 347741.840385 | 35386.637308 | 757.858077 |
| 2 | 2015 | conventional | BaltimoreWashington | 1.168077 | 7.681415e+05 | 56546.030769 | 487421.365385 | 45104.819423 |
| 3 | 2015 | conventional | Boise | 1.054038 | 7.088575e+04 | 45940.442500 | 10164.187115 | 5309.087692 |
| 4 | 2015 | conventional | Boston | 1.144038 | 5.237806e+05 | 4685.945192 | 409901.282692 | 1607.198846 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 427 | 2018 | organic | Syracuse | 1.242500 | 5.688144e+03 | 143.621667 | 101.041667 | 0.000000 |
| 428 | 2018 | organic | Tampa | 1.452500 | 8.415949e+03 | 125.270833 | 610.078333 | 0.000000 |
| 429 | 2018 | organic | TotalUS | 1.554167 | 1.510488e+06 | 137640.345000 | 360420.295000 | 1407.105833 |
| 430 | 2018 | organic | West | 1.613333 | 2.549791e+05 | 28676.619167 | 53221.085833 | 191.065833 |
| 431 | 2018 | organic | WestTexNewMexico | 1.645000 | 1.676254e+04 | 1918.260833 | 2275.085000 | 140.700833 |
432 rows × 8 columns
2. Análise Exploratória de Dados#
# Número de exemplos e features
data.shape
(432, 8)
# target distribution
data[target].hist()
<matplotlib.axes._subplots.AxesSubplot at 0x7f9d2336d4d0>
# estatisticas da target
data[target].describe()
count 432.000000
mean 1.394253
std 0.320811
min 0.659808
25% 1.145817
50% 1.385064
75% 1.630096
max 2.403208
Name: AveragePrice, dtype: float64
# Verificar se tem NaN
data.isna().sum()
year 0
type 0
region 0
AveragePrice 0
Total Volume 0
4046 0
4225 0
4770 0
dtype: int64
# Descrever dados categóricos
data.describe(include=[object])
| type | region | |
|---|---|---|
| count | 432 | 432 |
| unique | 2 | 54 |
| top | conventional | Albany |
| freq | 216 | 8 |
# Descrever dados numéricos
data.describe(include=[np.number])
| year | AveragePrice | Total Volume | 4046 | 4225 | 4770 | |
|---|---|---|---|---|---|---|
| count | 432.00000 | 432.000000 | 4.320000e+02 | 4.320000e+02 | 4.320000e+02 | 4.320000e+02 |
| mean | 2016.50000 | 1.394253 | 8.920706e+05 | 3.049745e+05 | 2.989822e+05 | 2.188050e+04 |
| std | 1.11933 | 0.320811 | 3.582444e+06 | 1.295333e+06 | 1.194733e+06 | 9.576609e+04 |
| min | 2015.00000 | 0.659808 | 1.289274e+03 | 2.679057e+00 | 6.390962e+00 | 0.000000e+00 |
| 25% | 2015.75000 | 1.145817 | 1.230606e+04 | 8.947879e+02 | 3.480897e+03 | 9.158173e-01 |
| 50% | 2016.50000 | 1.385064 | 1.180031e+05 | 1.063797e+04 | 3.042860e+04 | 3.231658e+02 |
| 75% | 2017.25000 | 1.630096 | 4.774981e+05 | 1.167135e+05 | 1.612054e+05 | 7.488512e+03 |
| max | 2018.00000 | 2.403208 | 4.212553e+07 | 1.467107e+07 | 1.243870e+07 | 1.153793e+06 |
3. Modelagem & Avaliação com Ensembles#
3.1 Baseline#
# Separar os dados em treinamento e teste
X = data.drop(columns=[target])
y = data[target]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# transformar features categóricas via one-hot-encoding
ct = ColumnTransformer(
transformers=[
("cat", OrdinalEncoder(), categorical_features),
],
remainder='passthrough'
)
# zscore para mudar escala os dados
scaler = StandardScaler()
# modelo baseline
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
# pipeline de machine learning
pipeb = Pipeline([
('column-transformer', ct),
('scaler', scaler),
('model', lr)
])
# cross validação da solução
cv_list_lr_baseline = cross_val_score(
pipeb,
X_train,
y_train,
cv=10,
scoring="r2"
)
mean_cv_lr_baseline = np.mean(cv_list_lr_baseline)
std_cv_lr_baseline = np.std(cv_list_lr_baseline)
print(f"Performance (r2): {round(mean_cv_lr_baseline, 4)} +- {round(std_cv_lr_baseline, 4)}")
Performance (r2): 0.5537 +- 0.1066
3.2 Modelando com Random Forest e Gradient Boosting#
# transformar features categóricas via one-hot-encoding
ct = ColumnTransformer(
transformers=[
("ohe", OneHotEncoder(), categorical_features),
],
remainder='passthrough'
)
# modelo RandomForestRegressor
rf = RandomForestRegressor(
n_estimators = 100,
criterion="squared_error",
random_state=0
)
# pipeline de machine learning
pipe1 = Pipeline([
('column-transformer', ct),
('model', rf)
])
# cross validação da solução
cv_list_pipe1_baseline = cross_val_score(
pipe1,
X_train,
y_train,
cv=10,
scoring="r2"
)
mean_cv_pipe1_baseline = np.mean(cv_list_pipe1_baseline)
std_cv_pipe1_baseline = np.std(cv_list_pipe1_baseline)
print(f"Performance (R2): {round(mean_cv_pipe1_baseline, 4)} +- {round(std_cv_pipe1_baseline, 4)}")
Performance (R2): 0.7605 +- 0.0821
# transformar features categóricas via one-hot-encoding
ct = ColumnTransformer(
transformers=[
("ohe", OneHotEncoder(), categorical_features),
],
remainder='passthrough'
)
# modelo GradientBoostingRegressor
rf = GradientBoostingRegressor(
n_estimators = 100,
loss="squared_error",
learning_rate=0.1,
random_state=0
)
# pipeline de machine learning
pipe2 = Pipeline([
('column-transformer', ct),
('model', rf)
])
# cross validação da solução
cv_list_pipe2_baseline = cross_val_score(
pipe2,
X_train,
y_train,
cv=10,
scoring="r2"
)
mean_cv_pipe2_baseline = np.mean(cv_list_pipe2_baseline)
std_cv_pipe2_baseline = np.std(cv_list_pipe2_baseline)
print(f"Performance (R2): {round(mean_cv_pipe2_baseline, 4)} +- {round(std_cv_pipe2_baseline, 4)}")
Performance (R2): 0.7936 +- 0.0495
Avaliação Experimental#
# resultados da cross-validação
df_result_cv = pd.DataFrame(
[cv_list_lr_baseline, cv_list_pipe1_baseline, cv_list_pipe2_baseline],
index=["baseline","RandomForestRegressor", "GradientBoostingRegressor"]
).T
df_result_cv
| baseline | RandomForestRegressor | GradientBoostingRegressor | |
|---|---|---|---|
| 0 | 0.538925 | 0.770106 | 0.792857 |
| 1 | 0.463798 | 0.557949 | 0.676694 |
| 2 | 0.630720 | 0.783451 | 0.820644 |
| 3 | 0.639834 | 0.730605 | 0.807127 |
| 4 | 0.402981 | 0.711545 | 0.817230 |
| 5 | 0.444394 | 0.778620 | 0.759449 |
| 6 | 0.543634 | 0.806082 | 0.770342 |
| 7 | 0.456912 | 0.746108 | 0.793083 |
| 8 | 0.720543 | 0.859275 | 0.822751 |
| 9 | 0.695542 | 0.860805 | 0.876062 |
# linearizar matriz
df_res = df_result_cv.stack().to_frame("balanced_accuracy")
df_res.index.rename(["fold", "pipelines"], inplace=True)
df_res = df_res.reset_index()
df_res.head(12)
| fold | pipelines | balanced_accuracy | |
|---|---|---|---|
| 0 | 0 | baseline | 0.538925 |
| 1 | 0 | RandomForestRegressor | 0.770106 |
| 2 | 0 | GradientBoostingRegressor | 0.792857 |
| 3 | 1 | baseline | 0.463798 |
| 4 | 1 | RandomForestRegressor | 0.557949 |
| 5 | 1 | GradientBoostingRegressor | 0.676694 |
| 6 | 2 | baseline | 0.630720 |
| 7 | 2 | RandomForestRegressor | 0.783451 |
| 8 | 2 | GradientBoostingRegressor | 0.820644 |
| 9 | 3 | baseline | 0.639834 |
| 10 | 3 | RandomForestRegressor | 0.730605 |
| 11 | 3 | GradientBoostingRegressor | 0.807127 |
plt.figure(figsize=(10,10))
ax = sns.boxplot(x="pipelines", y="balanced_accuracy", data=df_res)
ax = sns.swarmplot(x="pipelines", y="balanced_accuracy", data=df_res, color=".40")
Final pipeline#
# retreinar o pipeline selecionado com todos os dados de treinamento
from sklearn.metrics import r2_score
pipe2.fit(X_train, y_train)
y_pred = pipe2.predict(X_test)
bac = r2_score(y_test, y_pred)
print("R2: ", round(bac, 4))
R2: 0.8158
Visualização do Erro#
from yellowbrick.regressor import prediction_error
visualizer = prediction_error(pipe2, X_train, y_train, X_test, y_test)
from yellowbrick.regressor import residuals_plot
viz = residuals_plot(pipe2, X_train, y_train, X_test, y_test)